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7.1  Introduction: 


The  RPI  task  has  been  concerned  with  the  development  of  expert  systems  techniques 
for  automated  photointerpretation.  More  specifically,  our  efforts  have  been  directed  toward 
the  development,  implementation  and  demonstration  of  techniques  which  will  mimic  the 
job  of  a  trained  photoanalyst  in  interpreting  objects  in  monochrome,  single-frame  aerial 
images.  This  is  a  difficult  task  which  requires  a  combination  of  numerical  and  symbolic 
image  processing  techniques. 

During  the  course  of  this  effort  we  have  developed  a  novel  hierarchial,  region-based 
approach  to  automated  photointerpretation  (cf.  [1]).  Basically,  this  approach  proceeds 
by  first  segmenting  the  input  image  into  disjoint  regions  which  differ  in  tonal  or  textural 
properties.  The  spatial  relationships  between  different  regions  are  then  expressed  in  terms 
of  the  associated  adjacency  graph  where  nodes  represent  regions  and  the  connectivity 
indicates  regions  which  are  spatially  contiguous.  Based  upon  knowledge  of  the  underlying 
spatial  adjacency  graph,  together  with  various  self  and  mutual  region  attributes  or  features, 
the  problem  is  then  that  of  assigning  interpretations,  or  object  categroies,  to  each  of  the 
nodes.  This  is  generally  a  computationally  explosive  task.  The  novelty  of  our  approach 
is  that  we  have  been  able  to  develop  a  computationally  feasible  approach  to  this  symbolic 
interpretation  process. 

The  advantage  of  our  approach  is  based  upon  two  important  properties:  First,  we 
model  the  interpretation  process  as  a  Markov  random  field  (MRF)  defined  on  the  adjacency 
graph.  Secondly,  we  make  use  of  an  efficient  stochastic  relaxation  process  to  find  the 
most  likely  interpretation.  The  first  assumption  allows  us  to  localize  the  search  for  good 
interpretations  while  the  second  helps  in  avoiding  the  otherwise  computationally  explosive 
nature  of  the  search  for  optimum  interpretations. 

Our  mi  jor  effort  during  FY’88  has  been  in  refining  the  region  hierarchial  approach, 
improving  the  initial  segmentation  process  and,  finally,  demonstrating  the  approach  on 
real-world  aerial  photographs.  The  present  report  is  an  attempt  to  document  this  progress 
of  the  last  year. 

This  final  report  is  organized  as  follows:  In  Section  7.2  we  provide  a  detailed  de¬ 
scription  of  the  current  states  of  our  hierarchial,  region-based  approach  to  automated 
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photointerpretation.  This  is  followed,  in  Section  7.2  by  a  detailed  development  of  an  unsu¬ 
pervised  model-fitting  approach  to  cluster  validation  with  particular  application  to  image 
segmentation.  This  has  been  used  in  our  image  interpretation  approach.  In  Section  7.3 
we  describe  an  implementation  of  this  overall  image  interpretation  approach  on  the  TI 
Explorer. 

References  for  Section  7.1 

1.  J.W.  Modest ino,  “A  Hierarchial  Region-Based  Approach  to  Automated  Photointer¬ 
pretation,”  NAIC  Final  Report  for  FY’86. 

2.  H.L.  Van  Trees,  Detection,  Estimation  and  Modulation  Theory  I  Wiley  and  Sons,  New 
York,  1968. 

3.  R.  Kinderman  and  J.L.  Snell,  Markov  Random  Fields  and  Their  Applications,  Amer¬ 
ican  Mathematical  Society,  Providence,  RI,  1980. 
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In  this  section,  a  Markov  random  field  (MRF)  model-based  approach  to  automated 
image  interpretation  is  described  and  demonstrated.  This  scheme  is  a  region-based  ap¬ 
proach  in  which  an  image  is  first  segmented  into  a  collection  of  disjoint  regions  which 
form  the  nodes  of  an  adjacency  graph.  Once  the  adjacency  graph  has  been  determined, 
image  interpretation  is  achieved  through  assigning  object  labels,  or  interpretations,  to  the 
segmented  regions,  or  nodes,  using  domain  knowledge,  extracted  feature  measurements 
and  spatial  relationships  between  the  various  regions.  In  this  approach,  the  interpreta¬ 
tion  labels  are  modeled  as  a  MRF  on  the  corresponding  adjacency  graph  and  the  image 
interpretation  problem  is  then  formulated  as  a  maximum  a  posteriori  (MAP)  estimation 
rule  given  domain  knowledge  and  region-based  measurements.  Simulated  annealing  is  used 
to  find  this  best  realization,  or  optimal  MAP  interpretation.  Through  the  MRF  model, 
and  its  associated  Gibbs  distribution,  this  approach  also  provides  a  systematic  method  for 
organizing  and  representing  domain  knowledge  through  appropriate  design  of  the  clique 
functions  describing  the  Gibbs  distribution  representing  the  pdf  of  the  underlying  MRF. 
We  provide  a  general  methodology  for  design  of  the  clique  functions.  Results  of  image  in¬ 
terpretation  experiments  performed  on  synthetic  and  real-world  images  using  this  approach 
are  described  and  appear  promising. 

7.2.2  Background: 

Image  interpretation  is  the  process  of  understanding  the  meaning  of  an  image  through 
identifying  significant  objects  in  the  image  and  analyzing  their  spatial  relationships.  These 
objects  can  be  very  simple,  such  as  tools  and  work  parts  in  an  assembly  line  scene;  or  they 
can  be  quite  complicated  and  composed  of  many  simpler  objects,  such  as  a  runway  area 
full  of  airplanes  in  an  airport  scene. 

The  need  for  image  interpretation  can  be  found  in  many  diverse  fields  of  science 
and  engineering.  For  example,  a  major  application  of  image  interpretation  is  in  remote¬ 
sensing,  or  aerial/satellite  photointerpretation,  which  is  widely  used  in  geological  survey 
and  military  air  reconnaissance  [l]-  [3].  Image  interpretation  also  plays  an  important  part 
in  biomedical  science  and  particle  physics  where  much  of  the  experimental  results  are 


7.2.1 


recorded  in  the  form  of  photographs  [4]-[5]. 

Traditionally,  the  task  of  image  interpretation  is  performed  by  well-trained  and  expe¬ 
rienced  human  experts.  However,  analyzing  a  complex  image  is  quite  labor  intensive.  Fur¬ 
thermore,  as  a  result  of  the  rapid  advances  in  imaging/photographic  technology,  in  many 
of  the  previously  mentioned  applications  the  large  amount  of  images  generated  would  soon 
overload  the  relatively  small  number  of  experts.  Hence,  much  of  the  research  in  image 
processing  has  been  directed  towards  constructing  automated  (computerized)  image  inter¬ 
pretation  systems.  Recent  research  in  intelligent  robots  has  created  yet  another  need  for 
automated  image  interpretation.  In  this  case,  the  robots  need  to  understand  what  they 
“see”  with  imaging  sensors  in  order  to  be  able  to  perform  intelligent  tasks  in  complex  envi¬ 
ronments  [6]-[7].  Here,  the  robots  have  to  rely  entirely  on  automated  image  interpretation. 

Most  of  the  existing  image  interpretation  techniques  involve  two  major  operations, 
low-level  and  high-level  processing.  In  low-level  processing,  the  representation  of  an  image 
is  transformed,  through  image  processing  operations,  such  as  edge  detection  and  region 
segmentation,  from  a  numerical  representation,  as  an  array  of  pixel  intensities,  to  a  symbolic 
representation,  as  a  set  of  spatially  related  image  primitives ,  such  as  edges  and  regions. 
Various  features  are  then  extracted  from  the  primitives.  These  features  may  include:  the 
lengths  of  significant  edges,  average  intensities  of  regions,  shape  and/or  texture  descriptors, 
etc.  Also  extracted  would  be  the  spatial  relationship  between  the  image  primitives.  In  high- 
level  processing,  image  domain  knowledge  is  used  to  assign  object  labels,  or  interpretations, 
to  the  primitives  and  construct  a  description  as  to  “what  is  present  in  the  image” .  In  the 
rest  of  this  paper,  we  often  refer  to  the  object  labels  as  interpretations  and  the  overall 
interpretations  for  all  the  primitives  in  the  image  as  the  interpretation  of  the  image. 

The  main  approach  in  early  research  on  image  interpretation  was  that  of  classification 
[4]-[5],  [8]  in  which  isolated  image  primitives  are  classified  into  a  finite  number  of  object 
classes  according  to  their  feature  measurements.  However,  since  low-level  processing  often 
produces  erroneous  or  incomplete  primitives  and  noise  in  the  image  may  often  cause  mea¬ 
surement  errors  in  the  features,  the  performance  of  image  interpretation  systems  using  the 
classification  approach  is  quite  limited.  Tne  main  problem  here  is  that  the  rich  knowledge 
of  the  spatial  constraints  between  objects,  used  by  human  experts,  has  not  been  used  in 
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the  high-level  processing. 

To  solve  this  problem,  most  of  the  recent  techniques  have  adopted  the  knowledge - 
based ,  or  expert  system ,  approach.  In  this  approach,  domain  knowledge,  and  especially 
spatial  constraints,  are  used  in  high-level  (some  also  in  low-level)  processing.  Hence,  an 
ambiguous  object  may  be  recognized  as  the  result  of  successful  recognition  of  its  neigh¬ 
boring  objects.  Even  more  fundamentally,  an  object  can  be  recognized  from  combining 
the  feature  information  from  several  spatially-related  image  primitives.  Finally,  low-  level 
processing  errors  may  be  corrected,  or  at  least  mitigated,  through  feedback  from  high-level 
processing  to  low-level  image  processing. 

The  early  work  in  knowledge-based  image  interpretation  has  been  summarized  in  Na- 
gao  and  Matsuyama  [11],  Binford  [12]  and  Ohta  [13].  Recently,  a  number  of  more  sophisti¬ 
cated  experimental  systems  have  been  constructed  for  different  application  domains,  such 
as  high-altitude  aerial  photographs  [ll],  [14]-[i5],  [16]-[18];  airport  scenes  [19],  [20]-[2l] 
and  outdoor  scenes  [l3],[22]-[24].  Many  of  these  systems  are  still  undergoing  continu¬ 
ous  improvements  through  architecture  modification  and  domain  extension.  New  ideas 
and  systems  are  constantly  emerging,  as  can  be  seen  in  recent  PRCV  and  SPIE  confer¬ 
ences  and  workshops,  and  several  pertinent  technical  reports  [25]-[26],[21].  While  success 
has  been  demonstrated  to  various  degrees  in  these  systems,  developing  a  general,  domain- 
independent  and  systematic  method  for  constructing  knowledge-based  image  interpretation 
systems  is  still  an  open  problem  [22]. 

In  this  paper,  we  describe  a  general,  domain-independent,  stochastic  model-  based 
approach  to  the  image  interpretation  problem.  In  this  approach,  the  interpretation  la¬ 
bels  to  be  assigned  to  the  primitives  of  an  image  are  modeled  as  a  Markov  random  field 
(MRF)  defined  on  the  spatial  adjacency  graph  formed  by  the  primitives,  where  the  ran¬ 
domness  is  used  to  model  the  uncertainty  in  the  assignment  of  the  labels.  As  a  result, 
the  domain  knowledge,  whatever  it  may  be,  can  be  systematically  represented  in  terms  of 
the  clique  functions  associated  with  the  underlying  Gibbs  probability  distribution  function 
(pdf)  describing  the  MRF.  Under  the  MRF  modeling  assumption,  image  interpretation  is 
then  formulated  as  the  optimization  problem  of  maximizing  the  a  posteriori  probability 
of  interpretation  given  domain  knowledge  and  feature  measurements.  Then,  simulated 
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annealing  is  used  to  find  the  optimal  set  of  interpretation  labels.  In  this  paper,  we  present 
a  special  region-based  version  of  this  approach.  That  is,  the  primitives  are  segmented  re¬ 
gions;  and  for  the  sake  of  simplicity,  we  do  not  include  feedback  from  high-level  to  low-level 
processing.  However,  research  is  currently  ongoing  to  include  use  of  linear  edge  segments 
as  primitives  as  well  as  high-to-low  level  feedback  [27]-[28].  This  will  also  be  discussed  in 
subsequent  sections. 

This  paper  is  organized  as  follows.  In  the  next  section,  we  describe  the  MRF  model- 
based  formulation  for  the  image  interpretation  problem.  Then,  in  Section  7.2.4,  we  will 
show  how  domain  knowledge  can  be  organized  into  clique  functions  associated  with  the 
MRF  model.  After  the  discussion  of  the  implementation  of  the  optimization  (interpre¬ 
tation)  through  the  simulated  annealing  procedure  in  Section  7.2.5,  we  will  present  and 
discuss  results  of  image  interpretation  experiments  performed  on  synthetic  and  real-world 
images  in  Sections  7.2.6  and  7.2.7.  Finally,  a  summary  and  directions  for  future  research 
are  provided  in  Section  7.2.8. 

7.2.3  The  MRF  Model-Based  Approach  to  Image  Interpretation: 

The  MRF  model,  as  an  extension  of  the  one-dimensional  Markov  process,  has  recently 
attracted  much  attention  in  the  image  processing  and  computer  vision  community.  The 
main  advantage  of  the  MRF  model  is  that  it  provides  a  general  and  natural  model  for  the 
interaction  between  spatially  related  random  variables  and  there  is  a  relatively  efficient 
optimization  algorithm,  simulated  annealing,  that  can  be  used  to  find  the  globally  optimal 
realization  which,  in  this  case,  corresponds  to  the  maximum  a  posteriori  (MAP)  interpre¬ 
tation.  Up  to  now,  the  success  of  MRF  models  has  been  demonstrated  mostly  in  low-level 
image  processing  applications,  such  as  region  segmentation  [29)-[36]  and  edge  detection 
[37],  where  they  are  defined  on  two-  dimensional  (2-D)  lattices  on  which  the  images  are 
represented  as  2-D  arrays.  For  example,  in  stochastic  model-based  image  segmentation, 
the  pixels  are  classified  into  a  finite  number  of  statistical  classes  and  the  MRF  is  used  to 
model  the  spatial  distribution  of  pixel  classes,  or  region  distributions  [29]-[36].  However, 
as  demonstrated  by  Kinderman  and  Snell  [38],  the  MRF  can  be  defined,  in  general,  on 
graphs  for  which  the  2-D  lattice  is  a  special  case.  In  what  follows,  we  will  briefly  review 
the  concepts  associated  with  the  MRF  defined  on  graphs  and  show  how  this  can  be  applied 
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to  the  image  interpretation  problem.  More  comprehensive  treatments  on  MRF’s  can  be 
found  in  [38]- [39]. 

A.)  The  MRF  Model  on  Graphs: 

Let  G  =  {R,E}  be  a  graph,  where 

R  =  {-Ri»  Rat  •  •  • » R//}i  (l) 

is  the  set  nodes  represented  by  Ri,  i  =  1, 2, . . . ,  N]  E  is  the  set  of  edges  connecting  them. 
Suppose  that  there  exists  a  neighborhood  system  on  G,  denoted  by 

n  =  {n(Pi),n(P2),...,n(Ptf)},  (2) 

where  n(i2»),  s'  =  1,2,...,  JV,  is  the  set  of  all  the  nodes  in  R  that  are  neighbors  of  Ri,  such 
that 

i. )  Ri  n(iZj),  and 

ii. )  if  Rj  €  n(Ri)  then  R,  €  n(R3). 

Let 


I  =  {Ii,J2,...  ,In}  (3) 

be  a  family  of  random  variables  defined  on  R.  Then,  I  is  called  a  random  field ,  where  Ii 
is  the  random  variable  associated  with  Ri.  Notice  that  the  random  variables  /»•’ s  here  can 
be  numerical  as  well  as  symbolic,  e.g.,  interpretation  labels.  We  say  I  is  a  MRF  on  G  with 
respect  to  the  neighborhood  system  n  if  and  only  if 

i. )  P[I]  >  0,  for  all  realizations  of  I; 

ii. )  Pr[Ii\IitsX\  Rj  ?Ri]  =  P\Ii\Ij,Rj  €  n(Jfc)], 

where  P[«]  and  P[-|-]  are  the  joint  and  conditional  pdf’s,  respectively.  Intuitively,  the  MRF 
is  a  random  field  with  the  property  that  the  statistics  at  a  particular  node  depends  mainly 
on  that  of  its  neighbors. 

An  important  feature  of  the  MRF  model  defined  above  is  that  its  joint  pdf  has  a 
general  functional  form,  known  as  the  Gibbs  distribution,  which  is  defined  based  on  the 
concept  of  cliques  [38]-[39].  Here,  a  clique  associated  with  the  graph  G,  denoted  by  c, 
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is  a  subset  of  R  such  that  it  contains  either  a  single  node  or  several  nodes  that  are  all 
neighbors  of  each  other.  If  we  denote  the  collection  of  all  the  cliques  of  G  with  respect 
to  the  neighborhood  system  n  as  C(G,n),  the  general  functional  form  of  the  pdf1  of  the 
MRF  can  be  expressed  as  the  following  Gibbs  distribution: 

P[I]  =  Z-'exp[-U(I)l  (4a) 

where 


^(I)  =  £  K,(I),  (4b) 

e€C(G,a) 

is  called  the  Gibb  ’s  energy  function  and  V0(I)’s  are  called  clique  functions  defined  on  the 
corresponding  cliques  e  €  C(G,n).  Finally, 


is  the  normalization  factor  to  make  (4a)  a  valid  pdf.  Notice  that  the  MRF  pdf  above 
is  quite  rich  in  that  the  clique  functions  can  be  arbitrary  as  long  as  they  depend  only 
on  the  nodes  in  the  corresponding  cliques.  Due  to  this  unique  structure,  in  which  the 
global  and  local  properties  are  related  through  cliques,  the  MRF  model-based  approach  to 
image  interpretation  provides  potential  advantages  in  knowledge  representation,  learning 
and  optimization,  as  will  be  discussed  in  more  detail  later.  More  importantly,  this  method 
provides  a  useful  mathematical  framework  for  the  study  of  image  interpretation  procedures. 
B.)  The  MRF  Model-baaed  Formulation : 

As  described  in  Section  7.2.2,  for  the  time  being  we  restrict  the  image  interpretation 
problem  to  that  of  labeling  segmented  regions.  Suppose  for  a  given  image,  there  are  N 
disjoint  regions  after  segmentation,3  denoted  by  R  =  {Ri,Ra,...,R//}.  Then  R  can  be 
represented  by  a  set  of  nodes  in  a  connected  graph,  called  the  adjacency  graph ,  denoted 

1  Actually,  this  is  a  probability  mass  function  (pmf)  due  to  the  discrete  nature  of  I  although 
we  will  not  make  this  distinction  in  what  follows  and  continue  to  use  the  term  pdf. 
3Clearly,  the  number  N  of  segmented  regions  is  a  random  variable  depending  upon  the 
image  as  well  as  the  segmentation  procedure. 
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by  G  =  {R,E};  where  the  edge  set  E  is  such  that  a  node  Ri  is  connected  to  another 
node  Rj  if  and  only  if  the  corresponding  regions  are  spatially  adjacent.  A  neighborhood 
system,  denoted  by  n,  can  also  be  defined  on  the  adjacency  graph.  For  simplicity,  in 
what  follows  we  define  the  neighbors  of  a  node  to  be  the  nodes  that  are  connected  to  it 
directly  by  an  edge  of  G,  i.e.,  only  spatial  adjacent  regions  are  neighbors.  Now,  given 
the  neighborhood  system,  we  can  also  find  the  cliques  for  the  adjacency  graph.  As  an 
illustration,  we  have  shown  in  Fig.  7.2.1  the  adjacency  graph  and  all  its  cliques  for  a 
particular  synthetic  conceptual  image.  This  image  is  intended  to  represent  a  car  on  a  road 
between  two  fields  with  the  sky  as  a  background.  In  forming  the  adjacency  graph,  we 
assume  perfect  segmentation  of  the  image  objects. 

As  described  in  Section  7.2.2,  image  interpretation  is  the  process  of  assigning  object 
labels  to  the  segmented  regions  according  to  domain  knowledge  and  feature  measure¬ 
ment  information  (or  measurements ,  in  short)  made  on  these  regions.  From  the  above 
graphical  formulation,  the  interpretation  of  the  image  cam  be  represented  as  a  vector 
I(R)  =  {Ji,  J2, . . . ,  In},  defined  on  the  adjacency  graph  G,  where  we  use  I(R)  to  empha¬ 
size  the  relationship  between  interpretation  and  the  symbolic  representation  in  terms  of 
segmented  regions.  Here,  *  =  1,2,...,JV,  is  the  interpretation  label  for  node  P,-;  while 
Ii  €  L  and  L  =  {Lx,  L2, . . . ,  Lm}  is  the  set  of  all  the  interpretation  labels.  In  addition,  we 
consider  /,•’ s  as  symbolic  random  variables  to  account  for  the  uncertainty  in  assigning  ob¬ 
ject  labels  to  segmented  regions  due  to,  e.g.,  image  noise  and  segmentation  errors.  Hence, 
I(R)  is  a  random  field.  Let’s  denote  the  domain  knowledge  as  K  and  all  the  measurements 
made  on  the  segmented  regions  as  X(R).  Now,  we  can  define  image  interpretation  as  the 
following  optimization  problem :  for  a  given  R,  find  Io(R),  such  that 

Io(R)  =  arg  max  P[I(R)  |K,X(R)],  (5) 

where  P[*|*,  •]  is  the  a  posteriori  pdf  of  the  interpretation  given  the  domain  knowledge  and 
measurements,  while  {L}N  is  the  set  of  all  possible  interpretation  vectors  of  length  N. 
The  formulation  of  (5)  is  also  known  as  the  maximum  a  posteriori  (MAP)  formulation. 

Two  problems  must  be  solved  in  applying  the  above  MAP  approach  to  image  inter¬ 
pretation.  Specifically,  we  need  an  explicit  expression  for  the  conditional  pdf  in  (5)  and 
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an  optimization  method  to  avoid  the  computationally  explosive  nature  of  exhaustive  com¬ 
binatorial  search.  Feldman  and  Yakimovski  [40],  and  Faugeras  and  Price  [14]-[15]  have 
considered  similar  formulations  to  that  of  (5)  and  proposed  heuristic  expressions  for  the  a 
posteriori  pdf  using  the  marginal  pdf’s  of  single  and  joint  pdf’s  of  pairs  of  interpretation 
labels.  They  have  also  used  different  relaxation  schemes  to  find  local  optimal  solutions, 
some  of  which  have  also  been  studied  in  [54]-[57|.  On  the  other  hand,  the  MRF  model 
discussed  in  A.)  appears  to  provide  a  natural  solution  to  the  above  two  problems.  More 
specifically,  assume  that  I(R)  forms  a  MRF.  Then,  the  pdf  appearing  in  (5)  is  the  Gibbs 
distribution 


P|I(R)  |  K,X(R)]  =  Z-‘tzp[-V( I(R) ;  K,X(R))|,  (6a) 

with  energy  function 

Cf(I(R) ;  K,X(R))  =  Y,  K,(I(R) ;  K,X(R)),  (6b) 

c€C(  G,n) 

where  the  V0(*;  •,  -)’s  are  the  clique  functions.  Indeed,  as  will  be  seen  in  the  subsequent 
sections,  through  imposing  a  neighborhood  system  and  the  Markov  property  of  ii.),  the 
MRF  model-based  formulation  provides  a  general  and  systematic  approach  for  knowledge 
representation  and  knowledge  acquisition  through  appropriate  construction  of  the  clique 
functions.  For  the  optimization  strategy,  the  simulated  annealing  procedure  can  be  used 
to  find  the  globally  optimal  interpretation  for  the  image.  In  addition,  the  approach  of 
[40], [14], [15]  can  be  shown  to  be  special  cases  with  certain  neighborhood  structures  and 
clique  functions.  Finally,  when  used  in  the  context  of  image  interpretation,  the  MRF  model 
suggest  that  the  interpretation  for  a  particular  region  given  those  of  all  other  regions,  de¬ 
pends  only  on  the  interpretations  of  its  neighboring  regions.  This  is  often  a  reasonable 
assumption  in  practical  applications.  For  example,  the  identification  of  a  region  as  a  car 
might  depend  on  whether  its  neighboring  regions  are  a  road  but  has  little  to  do  with 
the  identity  of  the  regions  spatially  far  removed  from  it.  In  the  rest  of  the  paper,  we 
will  model  the  interpretation  vector  I(R)  as  a  MRF.  Since  simulated  annealing  is  a  rela¬ 
tively  well-defined  procedure,  we  will  concentrate  on  the  knowledge  engineering  aspects  of 
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the  image  interpretation  problem;  that  is,  knowledge  representation  and  learning  through 
constructing  the  pdf  of  the  MRF. 

Finally,  we  should  point  out  that,  although  the  MRF  model-based  approach  is  pre¬ 
sented  here  in  the  form  of  a  region-based  approach,  it  can  be  extended  to  include  other 
primitives  and  to  model  the  situation  where  interaction  between  high-level  and  low-level 
processing  (e.g.,  feedback)  is  used.  When  other  primitives,  such  as  linear  edge  segments, 
are  introduced  they  can  be  considered  as  nodes  of  a  generalized  adjacency  graph;  they 
can  also  be  considered  as  features  associated  with  different  regions  rather  than  primitives 
themselves.  To  model  the  high-level  and  low-level  interaction  during  interpretation,  the 
adjacency  graph  can  be  considered  as  a  dynamic  graph  which  changes  with  time;  subse¬ 
quently,  the  MRF  become  also  a  dynamic  model.  Currently,  these  problems  are  under 
active  investigation. 

7.2.4  The  Design  of  Clique  Functions: 

In  the  MRF  model-based  formulation  of  the  preceding  section,  it  is  clear  that  the 
optimal  interpretation,  Io  (R),  should  be  the  one  that  minimizes  the  energy  function,  or 
has  the  minimum  energy.  For  a  given  image,  the  optimal  interpretation  depends  on  how  the 
energy  function  is  defined.  In  general,  we  would  like  the  optimal  interpretation  obtained 
under  the  MRF  assumption  to  be  the  one  that  is  most  consistent  with  the  measurements 
and  domainr  knowledge.  For  example,  in  aerial  photointerpretation,  suppose  we  know 
that  a  car  has  small  area  and  would  usually  be  on  a  road.  An  interpretation  with  a  car 
having  large  area  or  in  the  sky  should  obviously  be  considered  not  optimal.  This  type 
of  consistency  requirement  can  be  achieved  by  properly  selecting  the  energy  functional 
or,  rather,  the  corresponding  clique  functions.  It  will  be  seen  in  the  following  that,  by 
using  the  MRF  model,  the  domain  knowledge  can  be  organized  easily  and  systematically 
as  clique  functions  to  provide  a  proper  energy  functional  such  that  the  consistency  between 
the  interpretation,  the  measurements  and  domain  knowledge  is  maintained. 

Without  loss  of  generality,  we  assume  that  all  the  clique  functions  sire  non-  negative. 
Then,  a  general  principle  for  the  selection  of  a  clique  function  is  the  following. 

Design  Rule: 

If  the  interpretation  of  the  regions  (or  region  for  a  singleton  clique)  in  a  clique  tends 
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to  be  consistent  with  the  measurements  and  domain  knowledge,  the  clique  function 
decreases,  resulting  in  a  decrease  in  the  energy  function;  otherwise,  the  clique  function 
increases,  resulting  in  a  corresponding  increase  in  the  energy  function. 

In  this  way,  an  interpretation  for  the  image  that  is  most  consistent  with  the  measurements 
and  domain  knowledge  will  have  the  minimum  energy,  or  achieve  the  optimum.  Based  on 
this  principle,  we  now  propose  a  general  approach  to  defining  clique  functions  from  domain 
knowledge.  We  first  consider  the  clique  functions  for  single-node  cliques,  and  then  extend 
the  result  to  the  case  of  multiple-cliques. 

A.)  Clique  Functions  for  Single-Node  Cliques: 

Let  e  be  an  arbitrary  single-node  clique  with  one  node,  R.  Let  the  corresponding  clique 
function  be  denoted  by  Va(l(R) ;  K,X(i?)),  it  depends  only  upon  the  single  node  R,  its 
interpretation  I  =  I(R),  and  the  measurements  X(J?)  on  the  corresponding  segmented 
region  R ,  as  well  as  the  domain  knowledge  represented  by  K.  Suppose  that  X(i2)  has 
m  components,  Xt(R),  X^(R), . . . ,  Xm(i2),  representing  measurement  values  of  m  well- 
defined  features  of  R,  e.g.,  average  gray  level,  area,  standard  deviation  of  gray-levels,  etc. 
Assuming  the  components  of  X(i2)  are  independent,  we  can  define  a  clique  function  for 
clique  c  as 


VC{I{R) ;  K,X(R))  -  £»£>(J(*),K)B<*>(/(S);K,Xi(*)),  (7) 

»=1 

where  •,  •),  i  =  1,2,  ...,m,  are  called  basis  functions  for  the  corresponding  clique 

function.  These  quantities  are  functions  of  the  »’th  feature  measurement,  Xi(R),  parame¬ 
terized  by  the  interpretation  I(R)  and,  of  course,  depend  upon  the  domain  knowledge,  K. 
The  pi^  (I,K)’s,  are  a  set  of  non  negative  numbers 

pW  (/,  K)  >  0  ;  i  =  1, 2, . . . ,  m,  (8a) 

which  can  be  conveniently  normalized  so  that 

£>W(/,K)  =  1,  (8b) 

»= 1 
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and  are  weights  associated  with  the  basis  functions.  Here,  p£^(/,K)  not  only  depends  on 
i  but  also  on  the  interpretation  7(72)  as  well  as  K. 

Now  the  problem  of  designing  clique  functions  becomes  that  of  designing  the  basis 
functions  of  the  features  and  determining  their  weights.  We  first  consider  the  design  of 
the  basis  functions.  Without  loss  of  generality,  we  assume  that  all  the  basis  functions 
are  non-negative.  Then,  the  consistency  principle  for  designing  clique  functions  (in  the 
previous  Design  Rule)  applies  to  the  design  of  the  basis  functions.  Here,  it  is  sufficient 
to  consider  the  design  of  a  particular  basis  function  for  a  single-node  clique  e,  denoted 
by  Bc(7(72);K,X(i?)),  where,  for  notational  simplicity,  the  index  *  has  been  dropped. 
According  to  the  consistency  requirement  between  interpretation,  measurements  and  do¬ 
main  knowledge,  we  want  the  basis  function  to  be  small  when  I(R),  X(R)  are  consistent 
according  to  K;  otherwise,  it  should  be  large.  One  way  to  achieve  this  is  to  take  a  prob¬ 
abilistic  approach.  In  particular,  we  consider  the  a  posteriori  pdf  Pc[7(72)  |  K,X(72)]. 
This  is  the  probability  that,  based  on  the  domain  knowledge,  K,  and  the  measurement, 
X(R ),  the  interpretation  of  the  node  R  should  be  7(12).  By  definition,  the  probability 
PC[I(R)  |  K,  AT(P)],  is  such  that  for  I(R)  consistent  with  the  measurements  and  domain 
knowledge,  it  is  large;  otherwise  it  is  small.  Hence,  a  non-increasing  function  of  this  pdf 
can  be  used  as  a  basis  function.  For  example,  the  logarithm  of  the  pdf  has  been  suggested 
[38]  as  a  reasonable  basis  function  for  general  MRF’s.  In  this  case  we  can  define 

Bc(l(R)-,K, X(R))  =  -aJogP'lUR)  |  K, *(£)],  (9) 

where  Qtc  is  a  positive  weighting  constant  and  —log(-)  is  a  monotonically  decreasing  func¬ 
tion.  Another  way  of  selecting  the  basis  function  is  to  use 

Be(l(R);K, X(R)J)  =  a0(l  -  0oPe\I(R)  |  K,X(J2)]),  (10) 

where  a0  and  0O  are  positive  constants,  and  (3aPc[I{R)  |  K,X(/2)]  <  1.  Usually,  we  want 
the  normalization  constants  a0  and  /?„  to  be  such  that  0  <  B0(l(R);K,X(R ))  <  1. 

To  find  the  pdf  PC[I(R)  |  K,X(72)],  Bayes’s  conditional  pdf  formula  can  be  used. 
That  is, 
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P,{I(R)  |  K,X(J5)]  =  P0|X(JJ)  |  K,  J(fl)]P0[K,/(fl)]p-l[K,X(fl)],  (11) 

where  the  first  term  is  the  likelihood  functional  of  the  measurement  conditioned  on  the 
interpretation,  which  can  be  found  easily  under  proper  modeling  assumptions,  and  the 
second  term  is  the  prior  pdf  of  the  interpretations,  which  can  be  determined  from  a  priori 
information,  or  heuristically.  Finally,  the  last  term  is  the  inverse  of  the  pdf  of  X(R)  which 
does  not  depend  on  /  and  hence  can  be  dropped  in  the  basis  function.  For  the  sake  of 
simplicity,  we  assume  the  prior  probability  to  be  a  constant;  that  is,  the  interpretations 
are  equally  likely  a  priori,  then  the  second  term  can  also  be  dropped. 

To  further  illustrate  what  we  mean  by  Pe[/(jR)  |  K,X(JR)]  and  how  a  basis  function 
can  be  defined  from  it,  consider  an  example  of  a  single  node  clique,  c.  Suppose  we  have 
the  following  knowledge: 

1.  The  node  could  be  sky,  field,  car,  road,  denoted  by  interpretation  labels  Lt,L/,Lc 
and  Lr. 

2.  The  average  gray-level,  a  feature  of  the  objects  above,  should  be  close  to  G,,  Gf,  Gc 
and  Gr  respectively,  and  G,  >  Gr  >  Ge  >  Gj. 

3.  The  distribution  of  the  measured  average  gray-level  of  the  node,  conditioned  on  each 
label  (sky,  field,  car,  road),  is  Gaussian.  That  is,  if  the  measured  average  gray-level 
X{R)  =  G,  then 


P[X(R)  |  K ,I(R)  =  Ls )  = 


(g  - 
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where  6  =  s,  /,  c,  r.  Then,  a  possible  basis  function  could  be  from  (9) 


(12) 


B.(I(R) ;  K,  X(R))  =  a.  (if<*2™?  +  -(G  .  (13) 

Plots  of  several  basis  functions,  including  the  one  proposed  by  Modestino  [41],  are 
shown  in  Fig.  7.2.2.  Notice  that  these  functions  are  all  ‘‘window-  like”  functions.  When 
domain  knowledge  about  a  feature  can  be  expressed  in  terms  of  a  nominal  value,  as  is  in 
the  example  above,  the  specification  of  the  corresponding  basis  function  can  be  greatly 
simplified  to  that  of  merely  constructing  one  of  these  window  functions.  The  piecewise 
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linear  basis  function  is  particularly  interesting  in  that  it  is  very  easy  to  compute  and,  as 
will  be  seen  in  later  sections,  it  is  relatively  robust  against  measurement  errors  or  image 
noise.  Hence,  we  give  it  a  special  notation,  g(x\  a  1,02, 61}  62);  where  x  is  the  variable  and 
are  the  four  “comer  points”,  with  fli  <  02  <  b\  <  62.  Similar  functions  have 
been  used  in  [22]  for  a  rule-based  image  interpretation  system  and  in  the  applications  of 
fuzzy  set  theory  [42]. 

B.)  Clique  Functions  for  Multiple-Node  Cliques: 

The  extension  of  the  clique  function  design  procedure  from  the  case  of  single-node 
cliques  of  A.)  to  the  case  of  multiple-node  cliques  is  quite  straightforward.  Here,  we  still 
design  clique  functions  through  designing  a  set  of  basis  functions,  as  indicated  in  expression 
(7).  However,  the  designing  of  the  basis  function  is  slightly  more  complicated  here  in  that 
we  may  have  two  types  of  basis  functions.  The  first  type  is  the  basis  function  for  feature 
measurements,  as  in  the  case  for  single-node  cliques.  The  features  in  this  case  could  be 
quantities  such  as  mutual  boundary  length,  contrast ,  etc.  Basis  functions  for  these  feature 
measurements  can  be  designed  in  the  same  way  as  that  in  A.)  using  the  window  functions 
of  Fig.  7.2.2.  The  second  type  of  basis  functions  are  those  for  spatial  constraints.  The 
constraints  in  this  case  could  be  statements  such  as  “a  car  should  be  on  (neighboring 
to)  the  road”,  “a  car  should  never  be  in  the  sky”,  etc.  In  this  case,  we  can  still  use  the 
probabilistic  approach  in  the  spirit  of  (9)-(l0).  For  example,  consider  an  arbitrary  clique  c 
with  multiple  nodes  denoted  by  R0  and  interpretations  Ic(Rc).  Let  P0[I0(RC)  |  K]  be  the 
probability  that  the  combination  of  interpretations  7e(Re)  is  valid  according  to  domain 
knowledge.  For  example,  we  might  have 


P0[Io(Re)  |  K]  =  1;  if  I0  is  a  valid  combination  according  to  domain  knowledge, 

=  0;  if  I0  is  not  a  valid  combination  according  to  domain  knowledge. 

Similar  to  (9)-(10),  we  can  define  the  basis  function  as 

RC(IC(R);K)  =  ac(l  -  Pc[Ic(R)  I  K]).  (14) 

C.)  The  Selection  of  the  Weights  for  the  Basis  Functions: 
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The  weights  of  the  basis  functions  in  (7)-(8)  control  the  contributions  of  the  individual 
basis  functions  to  the  value  of  a  clique  function.  For  simplicity,  we  may  make  them  all 
equal.  In  our  current  experiments  we  start  with  this  simple  scheme  and  then,  if  a  feature 
is  too  unreliable  for  a  particular  object  type,  we  will  reduce  the  corresponding  weight.  In 
addition,  adjustments  are  also  made  by  trial-and-error  through  examining  interpretation 
results  on  representative  training  images.  A  more  sophisticated  approach,  currently  under 
investigation,  is  to  select  a  weight  based  on  how  powerful  the  corresponding  feature  is  for 
object  recognition  and  discrimination.  For  example,  consider  an  arbitrary  weight,  denoted 
p£*^(J,  K),  for  a  single-node  clique.  It  depends  on  the  t’th  feature  and  object  label  I.  If 
the  t’th  feature  is  good  for  discriminating  different  objects,  or  it  is  a  good  feature  for 
recognizing  object  /,  the  weight  (/,  K)  should  be  relatively  large;  otherwise,  relatively 
small.  A  useful  indication  of  whether  a  given  feature  is  good  for  object  discrimination 
can  be  obtained  from  the  inter-duster  distances  [43],  where  the  clusters  are  formed  by 
the  measurements  of  the  feature  from  different  objects.  Similarly,  a  useful  indication  of 
whether  a  feature  is  good  for  recognizing  a  particular  type  of  object,  say  type  I,  can  be 
obtained  from  the  intra-cluster  standard  deviation  [43],  where  the  cluster  is  formed  by  the 
measurements  of  the  t’th  feature  on  many  objects  of  type  I. 

D.)  Remarks: 

To  conclude  this  section,  we  note  several  interesting  points.  First,  through  the  design 
of  clique  functions,  we  have  a  systematic  approach  for  representing  spatial  knowledge; 
that  is,  for  organizing  the  domain  knowledge  into  a  set  of  well-defined  clique  functions. 
This  approach  also  provides  guidelines  as  to  what  kind  of  knowledge  one  would  want 
for  the  purpose  of  image  interpretation;  basically,  knowledge  concerning  objects  spatially 
related  as  members  of  different  type  of  cliques.  It  seems  that  many  of  the  “rules”  in 
the  previous  expert  systems  mentioned  in  Section  7.2.2  can  be  transformed  into  clique 
functions,  where  the  condition  parts  correspond  to  evaluating  clique  functions  and  the 
action  parts  correspond  to  assigning  labels. 

Secondly,  under  the  current  neighborhood  system  assumptions,  there  are  at  most  four 
different  clique  types,  as  shown  in  Fig.  7.2.3,  which  contain  at  most  four  nodes.  This  is 
due  to  the  fact  that  the  adjacency  graph  associated  with  the  segmented  regions  is  a  planar 
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graph,  a  graph  without  overpassing  edges;  while  a  clique  containing  five  or  more  nodes 
causes  overpassing  of  edges  in  the  graph.  Hence,  the  design  of  clique  functions  is  relatively 
simple  due  to  the  small  number  of  different  clique  types. 

Finally,  as  has  been  pointed  out  in  Section  7.2.3,  the  Gibbs  distribution  is  a  very 
rich  distribution  in  that,  as  long  as  the  clique  functions  depend  only  on  the  corresponding 
cliques,  their  form  can  be  somewhat  arbitrary.  The  general  guidelines  provided  in  this 
section  on  designing  clique  functions  are  based  on  the  considerations  of  the  consistency 
requirement  in  the  image  interpretation  problem.  While  they  provide  useful  insights  into 
the  image  interpretation  problem  and  offer  practical  solutions,  they  are  not  necessarily  the 
only  or  the  best  choice. 

7.2.5  Tmnlemftntation  Through  Simulated  Annealing: 

In  the  last  section,  image  interpretation  is  formulated  as  a  MAP  estimation  problem. 
Under  the  MRF  modeling  assumption,  this  becomes  the  problem  of  minimizing  a  prop¬ 
erly  defined  energy  function  such  that  the  interpretation  obtained  is  most  consistent  with 
measurements  and  domain  knowledge.  The  simplest  optimization  method  is  an  exhaustive 
search  procedure.  This,  however,  results  in  an  exponential  complexity  of  0(MN),  where 
AT  is  the  number  of  labels  and  N  is  the  number  of  nodes  in  the  adjacency  graph.  An 
alternative  is  the  simulated  annealing  algorithm,  a  stochastic  iterative  optimization  pro¬ 
cedure,  that  will  find  the  global  maximum  of  the  pdf  of  the  MRF,  or  the  minimum  of  the 
energy  function,  without  excessive  computation  [29], [48].  The  simulated  annealing  algo¬ 
rithm  has  been  widely  used  in  various  applications  involving  combinatorial  optimization, 
such  as  VLSI  layout  [44],  channel  coding  [45],  image  segmentation  [46]-[47]. 

For  convenience,  we  rewrite  this  algorithm  here  in  the  context  of  a  minimization 
problem.  Let  the  function  to  be  minimized  be  22 (x),  where  x  is  the  indepdendent  variable. 
This  algorithm  can  be  loosely  described  as  follows: 

Simulated  Annealing 

1)  Select  an  initial  “temperature”  parameter  To  and  randomly  choose  an  initial  variable 
Xo.  Iteration  begins. 

2)  At  step  k,  perturb  x*  by  &*+i  =  x*  +  Ax  and  compute  A E  =  £(x*+i)  -  £(x*). 

3)  If  A E  <  0  accept  the  change;  that  is 
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x*+i  =  x*  +  Ax. 


If  A E  >  0,  accept  the  change  only  with  probability  p  =  e“Ar/T. 

4)  If  there  is  a  considerable  drop  in  energy,  or  enough  iterations,  lower  the  temperature. 

5)  If  the  energy  becomes  stable  and  the  temperature  is  very  low,  stop;  otherwise  go  back 
to  (2). 

For  image  interpretation,  the  implementation  is  straightforward  when  the  definition  of  the 
method  of  perturbation  and  the  annealing  schedule  (i.e.,  how  the  temperature  is  lowered) 
are  decided.  We  first  order  all  the  nodes  arbitrarily  as  node  1,2,  ...,JV.  Then,  an  iteration 
is  defined  as  one  visit  to  all  the  nodes  according  to  this  order.  When  a  node  is  visited,  a 
perturbation  of  the  interpretation  vector  is  performed  through  generating  a  new  label  for 
this  node  from  an  uniform  distribution  of  all  the  possible  interpretations,  Li,  £2>  . . . ,  Lm, 
where  M  is  the  number  of  different  labels,  or  from  the  conditional  probability  distribution 
of  the  MRF  (i.e.,  the  Gibbs  sampler  of  [29]).  In  our  experiments,  we  found  that  the 
two  perturbation  methods  provide  the  same  results  in  terms  of  converging  to  the  optimal 
interpretations,  while  the  Gibbs  sampler  is  more  complicated  in  computation  structure,  so 
we  have  mainly  made  use  of  the  first  approach  for  perturbation.  Finally,  for  the  annealing 
schedule,  the  temperature  is  lowered  after  each  iteration  according  to  Tj+i  =  aTi  where 
0.5  <  a.  <  1,  which  we  have  used  successfully  in  an  MRF  model-based  MAP  image 
segmentation  procedure  [46].  In  particular,  we  have  selected  To  =  1  and  a  =  0.92  for 
all  our  experiments. 

7.2.6  Interpretation  of  Synthetic  Images: 

After  the  discussion  of  the  previous  sections,  it  may  be  expected  that  the  efficacy 
of  the  MRF  model-based  approach  depends  on  several  factors,  including  the  validity  of 
the  MRF  assumption,  the  quality  of  the  segmentation,  how  powerful  the  features  and 
spatial  constraints  in  the  knowledge  base  are  (as  far  as  object  recognition  is  concern),  and 
finally,  how  effective  the  simulated  annealing  is  (convergence  of  interpretations).  Before 
applying  the  MRF  approach  to  real-world  image  interpretation,  then,  it  make  sense  to  test 
whether  this  approach  would  work  at  all  under  the  somewhat  ideal  situation  in  which  we 
have  acceptable  segmentations  and  relatively  strong  features  and  spatial  constraints  in  the 
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domain  knowledge.  If  the  MRF  model-based  approach  works  well  here,  then  it  is  reasonable 
to  expect  that  it  will  be  effective  for  real-world  image  interpretation  provided  we  can 
produce  good  segmentations  (or  are  able  to  deal  with  poor  ones)  and  find  strong  features 
and  spatial  constraints.  We  can  create  such  an  ideal  situation  through  generating  synthetic 
images  and  studying  the  performance  of  the  MRF  model-based  approach  in  interpreting 
them.  In  this  section,  we  describe  some  of  the  experimental  results  on  synthetic  images 
taken  from  a  more  complete  study  [41], [46]. 

The  synthetic  images  used  in  this  experiment  are  variations  from  the  conceptual  image 
of  Fig.  7.2.1  which  contains  such  objects  as  sky,  road,  field,  and  car,  all  of  which  appear  as 
regions  of  constant  gray-levels.  The  assumed  domain  knowledge  associated  with  this  image 
is  shown  in  Table  7.2.1  where  object  features  (with  precise  definitions  in  Table  7.2.2)  and 
spatial  constraints  are  stated.  In  the  experiments  [41], [46],  we  have  found  that,  compared  to 
other  basis  functions  of  Fig.  7.2.2,  the  piecewise-linear  basis  functions  were  more  effective 
for  object  recognition  and  less  sensitive  to  segmentation  error.  Hence,  all  the  results  for 
image  interpretation  in  this  paper  have  been  obtained  using  clique  functions  composed  of 
this  type  of  basis  function.  In  Table  7.2.3,  these  clique  functions  are  shown  in  terms  of  their 
basis  functions  and  weights  for  the  synthetic  image  (corner  points  «i,  a2, 6ls  62,  weight  p, 
and  ae  for  the  basis  function  for  spatial  constraints).  Finally,  the  segmentation  algorithm 
used  here  is  a  Gaussian  model-based  segmentation  algorithm  [49]  which  has  been  quite 
effective  for  aerial  photograph  segmentation. 

The  experiments  on  the  synthetic  image  described  in  this  section  contain  three  parts. 
First,  the  “ideal  image”  of  Fig.  7.2.1  is  interpreted  with  result  shown  in  Fig  7.2.4.  In 
this  case,  the  extracted  segmentation  is  perfect,  as  shown  in  Fig.  7.2.4  b,  since  all  the 
regions  have  constant  gray  levels.  The  interpretation  result  in  Fig.  7.2.4  c,  in  which 
different  gray-  levels  indicate  different  object  labels  (as  shown  in  Fig.  7.2.4  d),  shows 
that  all  the  regions  are  correctly  identified.  This  suggests  that  when  the  image  is  well- 
segmented  and  the  domain  knowledge  is  sufficient,  the  MRF  model-based  approach  is 
effective.  As  reported  in  [46],  starting  from  a  random  initial  interpretation  vector,  the 
simulated  annealing  converged  within  25  iterations.  In  this,  and  the  rest  of  the  results  of 
this  section,  the  clique  functions  of  Table  7.2.3  have  been  used,  and  the  number  of  iteration 
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for  the  simulated  annealing  is  set  to  25. 

In  the  second  part  of  the  experiments,  the  ideal  image  of  Fig.  7.2.1  is  corrupted 
by  additive  white  Gaussian  noise  to  generate  degraded  images  of  different  signal-to-noise 
ratios  (SNR’s).  These  images,  then,  are  presented  for  interpretation.  Here,  the  added 
noise  should  result  in  errors  in  segmentation  and  feature  measurements.  This  is  used  to 
study  the  performance  of  the  MRF  approach  under  moderately  imperfect  segmentation. 
In  [46]  experiments  have  been  performed  for  images  with  SNR  of  20dB,  lOdB,  and  3dB 
with  similar  results.  Hence,  the  results  are  only  shown  in  Fig.  7.2.5  for  the  3dB  case. 
Again,  all  the  objects  are  correctly  identified.  Here,  the  car  has  very  a  small  area  and 
its  identification  might  be  most  seriously  affected  by  segmentation  error.  However,  since 
the  road  can  be  well  identified,  the  car  can  still  be  identified  partly  due  to  the  spatial 
constraints  between  them.  To  be  able  to  deal  with  more  serious  segmentation  errors,  such 
as  the  case  in  which  the  car  is  split  into  several  regions,  the  clique  functions  have  to  be 
expanded  to  allow  the  merging  of  regions.  This  is  currently  under  investigation  [27]. 

In  the  last  part  of  the  experiments,  we  have  consider  the  case  where  there  are  unknown 
objects  in  the  image  for  which  no  information  is  available  in  the  knowledge  base.  There  sure 
two  main  causes  for  unknown  regions,  e.g.,  the  region  belongs  to  an  unknown  object  or  the 
region  belongs  to  a  known  object  but  the  feature  measurements  are  far  from  the  nominal 
values  of  all  known  objects  due  to  segmentation  errors.  In  both  cases,  it  is  more  desirable  to 
assign  an  “unknown”  label  to  such  regions  than  to  risk  making  a  mistake.  Hence,  we  have 
added  an  unknown  label  to  the  set  of  object  labels.  In  this  part,  an  “unknown”  object 
which  is  similar  to  a  car  is  placed  in  the  right  “field”  region  of  the  images  used  in  the 
preceding  experiments  and  the  resulting  image  has  been  re-interpreted  [46].  Here,  we  have 
only  shown  the  results  for  the  ideal  image  and  the  image  with  3dB  SNR  in  Fig.’s  7.2.6  and 
7.2.7,  respectively.  All  the  regions  have  been  identified  correctly  and  the  unknown  region 
is  not  identified  as  a  car  since  that  will  violate  the  spatial  constraint  in  the  knowledge  base. 
This  example,  to  some  extent,  shows  the  flexibility  of  the  MRF  approach. 

7.2.7  Interpretation  of  Real-World  Images; 

To  test  the  practical  applicability  of  the  proposed  MRF  approach  for  image  inter¬ 
pretation,  experiments  have  been  performed  on  real-world  images;  in  particular,  aerial 
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photographs,  which  have  been  digitized  to  256  gray-level  images  of  256  x  256  pixels.  Ex¬ 
perimental  results  obtained  here  also  provide  further  insights  and  useful  guidelines  on 
how  to  effectively  apply  this  general  approach  in  practice.  Since  we  have  no  control  over 
the  generation  process  of  the  images  involved,  the  task  of  interpreting  real-world  images 
is  much  more  complicated  and  difficult  than  that  of  interpreting  synthetic  images.  We 
have  proceeded  in  two  steps;  namely,  knowledge  acquisition  through  constructing  clique 
functions  of  the  MRF  model  and  interpretation  using  simulated  annealing. 

A.)  Knowledge  Acquisition 

This  is  the  process  of  gathering  information  about  the  objects  of  interest  in  real- 
world  images.  This  information  is  usually  represented  in  terms  of  features  and  spatial 
constraints.  For  example,  we  might  like  to  obtain  information  about  cars,  such  as  “a  car 
has  an  average  area  of  800  pixels”  and  “a  car  is  always  on  (neighboring  to)  a  road” .  Here, 
the  area  is  a  feature  while  the  neighborhood  relationship  is  a  spatial  constraint.  Knowledge 
acquisition  is  also  a  selection  process  in  which  we  select  certain  features  and  constraints 
from  all  the  features  and  constraints  we  know  about  the  objects  to  form  a  knowledge 
base.  The  selection  process  is  necessary,  since  some  of  the  features  and  constraints  are 
not  essential  to  interpretation,  while  they  only  add  system  complexity  and  computational 
burden.  In  the  selection  of  the  features  and  constraints,  we  want  to  select  the  ones  that 
are  powerful  for  object  recognition  and  discrimination.  At  this  point,  this  selection  is 
performed  heuristically  through  trial-and-error.  As  a  future  goal,  a  general  approach  to 
solve  this  problem,  such  as  the  one  described  in  Section  7.2.4  C,  needs  to  be  found. 

There  are  many  sources  for  knowledge  acquisition.  For  aerial  photointerpretation, 
one  source  of  information/knowledge  is  from  map  information  and  characteristics  of  the 
man-made  objects  in  the  area  being  photographed  [20].  This  approach  is  used  often  in 
constructing  practical  photointerpretation  systems  and  it  usually  involves  a  large  amount 
of  information.  A  less  complicated  alternative  is  the  training  approach.  In  this  approach,  a 
number  of  representative  or  training  images  are  first  segmented  and  interpreted  by  human 
experts;  knowledge  is  then  extracted  from  these  segmented  and  interpreted  images.  The 
training  approach  is  relatively  simple,  while  not  sacrificing  generality  in  principle;  hence 
is  very  useful  in  experimental  studies  such  as  this.  In  this  work,  we  make  exclusive  use  of 
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the  training  approach. 

For  simplicity,  only  one  training  image  is  used  in  this  experiment  which  is  an  aerial 
photograph  as  shown  in  Fig.  7.2.8.  After  human  expert  segmentation  and  interpretation 
performed  using  an  interactive  display-segmentation  facility  at  RP1  [51],  it  has  been  found 
to  contain  mainly  the  following  types  of  objects:  l.jvegetation  region  (VEGE);  2.)shadow 
(SHD1);  3.)shadow  on  the  ground  (SHD2);  4.)ground  (GRND);  and  5.)oil  tank  (OLTK). 
The  feature  measurements  selected  for  the  objects  includes:  area,  average  gray-level,  com¬ 
pactness  of  an  object,  and  contrast  between  two  regions  (see  the  definitions  in  Table  7.2.2). 
The  constraints  selected  for  the  objects  include  a  number  of  neighboring  relationships.  For 
the  clique  functions,  we  have  used  again  the  piecewise-linear  basis  functions  for  the  fear 
tures  and  basis  functions  of  (14)  for  spatial  constraints.  In  particular,  the  comer  points  of 
the  basis  function  for  a  given  feature  of  a  given  object  is  determined  from  observing  the 
maximum  and  minimum  of  the  measurements  of  that  feature  on  this  type  of  object  identi¬ 
fied  by  the  human  interpreter.  “Guard  intervals”  are  introduced  around  the  the  maximum 
and  minimum  to  determine  the  exact  values  of  the  four  comer  points.  This,  as  well  as  the 
selection  of  the  weights,  has  been  performed  heuristically,  since  there  are  relatively  few 
objects  of  each  type.  When  the  number  of  objects  of  different  types  is  large,  the  process  of 
determining  the  comer  points  can  be  automated  [50].  In  Table  7.2.4,  we  have  shown  the 
domain  knowledge  learned  from  the  training  data  in  the  form  of  the  basis  functions  from 
which  the  clique  functions  are  constructed.  This  set  of  basis  functions  and  subsequent 
clique  functions  have  been  used  to  obtain  all  the  experimental  results  to  be  described  in 
this  section. 

B.)  Interpretation  Using  Simulated  Annealing : 

In  this  experiment,  interpretation  has  been  performed  on  two  test  images.  The  first 
one  is  the  training  image  itself.  This  is  used  to  verify  the  correctness  and  effectiveness 
of  the  knowledge  obtained  in  the  form  of  clique  functions  from  the  training  stage.  Here 
we  have  performed  interpretation  on  the  training  image,  using  both  manual  segmentation 
and  computer  segmentation,  as  shown  in  Fig.’s  7.2.8  and  7.2.9,  respectively.  Since  the 
computer  segmentation  provides  comparable  quality  to  that  of  the  manual  segmentation, 
the  subsequent  interpretation  results  are  both  quite  good.  Here,  most  of  the  regions  are 
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correctly  identified  except  that  some  of  the  oil  tanks,  which  appear  only  partly  in  the  image, 
are  labeled  as  unknown  objects.  The  reason  for  this  is  that  in  the  knowledge  base,  oil  tanks 
are  characterized  as  circular  objects,  as  reflected  from  the  definition  of  the  compactness  in 
Table  7.2.2  and  the  corner  points  for  the  corresponding  basis  functions  in  Table  7.2.4.  In 
other  words,  oil  tanks  that  are  only  partly  in  the  image  were  treated  as  unknown  objects 
in  the  training  stage  and  hence,  it  is  not  surprising  that  they  have  been  interpreted  as  so 
in  the  interpretation  stage.  To  recognize  these  “partial”  oil  tanks,  more  powerful  shape 
features  should  be  used,  as  described  below. 

The  second  test  image,  as  shown  in  Fig.  7.2.10  a,  is  “cut”  from  a  larger  image 
(2048  x  2048),  from  which  the  training  image  is  obtained,  and  contains  similar  objects  to 
those  in  the  training  image.  This  image  is  used  to  test  the  usefulness  of  the  knowledge 
and  clique  functions  obtained  from  the  training  image.  Notice  that  in  the  original  image 
the  gray  tone  and  texture  of  several  oil  tanks  are  so  close  to  those  of  their  surroundings 
that  they  are  very  hard  to  extract  even  by  human  eyes;  it  is  then  not  surprising  that 
the  computer  segmentation  of  several  regions  corresponding  to  oil  tanks  is  rather  poor, 
as  shown  in  Fig.  7.2.9  c,  especially  in  their  shapes.  As  a  result,  in  the  interpretation 
results  shown  in  Fig.  7.2.10  c,  most  of  the  regions  are  correctly  identified  except  for  these 
oil  tank  regions.  In  fact,  the  compactness  feature  failed  to  be  effective,  since  the  regions 
corresponding  to  the  oil  tanks  have  very  noisy  boundaries  and  some  of  them  are  only 
partially  in  view.  To  solve  this  problem,  Yu  has  proposed  a  more  robust  shape  feature, 
called  partial  compactness,  for  the  oil  tanks  [52].  This  feature  is  based  on  the  idea  of 
obtaining  a  large  number  of  estimates  of  the  radius  of  a  segmented  region  from  random 
points  on  the  boundary  of  the  region,  as  illustrated  in  Fig.  7.2.11.  Here,  we  have  shown 
a  set  of  three  random  points,  A,  B ,  and  C,  on  the  boundary  of  a  segmented  region.  The 
vertical  equal  division  lines  of  AB  and  BC  intersect  at  O,  resulting  in  a  random  estimate 
of  the  radius  r.  In  this  way,  every  set  of  three  random  boundary  points  provides  a  random 
estimate  of  radius.  If  a  region  is  reasonably  circular,  or  partially  circular,  the  variance  of 
the  random  estimates  tends  to  be  small.  It  has  been  shown  by  Yu  [51]  that  this  feature  is 
quite  powerful  for  recognizing  both  circular  and  partially  circular  objects,  and  is  relatively 
robust  to  noisy  boundaries.  An  interpretation  of  the  second  image  is  performed,  again, 
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using  the  knowledge-base  and  clique  functions  of  Table  7.2.4,  except  with  the  compactness 
replaced  by  Yu’s  partial  compactness  (also  shown  in  Table  7.2.2  and  7.2.4).  As  can  be 
seen  from  the  interpretation  results  shown  in  Fig.  7.2.12  c,  improvements  are  obtained 
in  that  all  the  oil  tanks  are  correctly  identified.  This  justifies  the  points  made  in  the 
experiments  on  interpreting  synthetic  images,  that  the  MRF  model-based  formulation  is 
quite  powerful  and  feature  selection  is  the  crucial  problem  in  applying  it  to  real-world 
image  interpretation. 

C.)  Remarks: 

In  concluding  this  section,  we  want  to  point  out  that  purpose  of  the  set  of  sim¬ 
ple  experiments  performed  here  is  to  understand  some  of  the  fundamental  problems  in 
knowledge-based  image  interpretation,  such  as  knowledge  representation,  the  selection  of 
features  and  constraints,  and  the  interaction  between  interpretations  of  different  regions. 
While  this  should  provide  useful  insights  into  how  to  build  practical  systems,  the  knowl¬ 
edge  base  and  clique  functions  used  here  are  not  sufficient  yet  as  a  practical  system.  For 
example,  the  features  used  in  these  experiments  are  neither  meant  to  be  sufficient  nor  the 
best  to  use,  as  would  have  been  done  in  building  practical  systems;  but  rather,  they  are 
used  here  to  demonstrate  useful  concepts  and  important  issues  related  to  the  performance 
of  knowledge-based  image  interpretation  systems.  Finally,  a  practical  system  should  also 
incorporate  the  capability  of  recognizing  complex  objects  composed  of  simpler  ones;  for 
example,  recognizing  a  runway  area  from  road-like  regions  containing  a  number  of  air¬ 
planes.  It  appears  that  the  MRF  model-based  formulation  can  be  use  here  to  construct  a 
hierarchical  representation  for  objects  of  different  complexity.  In  this  representation,  the 
regions  corresponding  to  simple  objects  form  a  low-level  MRF,  while  regions  corresponding 
to  complex  objects  which  are  collections  of  simple  object  regions  form  a  high-level  MRF. 
The  optimal  interpretation  is  the  one  that  maximizes  the  pdf  of  this  hierarchical  MRF.  In 
fact,  a  similar  hierarchical  MRF  model  has  been  used  successfully  for  image  segmentation 
using  the  pyramid  image  structures  [53]. 

7.2.8  Summary  and  Future  Research; 

In  this  paper,  we  have  described  an  MRF  model-based  approach  for  automated  image 
interpretation  demonstrated  as  a  region-based  approach.  In  this  approach,  an  image  is  first 
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segmented  into  a  collection  of  disjoint  regions  of  certain  homogeneous  image  properties. 
These  regions,  together  with  their  spatial  relationships,  form  an  adjacency  graph,  and 
image  interpretation  is  achieved  through  assigning  object  labels,  or  interpretations,  to  the 
regions  using  domain  knowledge  and  measurements  of  features  and  spatial  relationships 
extracted  from  them.  In  this  approach,  the  interpretations  are  modeled  as  a  MRF  on  the 
adjacency  graph  and  the  image  interpretation  problem  is  formulated  as  that  of  finding 
the  best  realization  of  the  MRF  given  the  domain  knowledge  and  measurements.  Through 
the  MRF  model,  this  approach  provides  a  systematic  methodology  for  organizing  and 
representing  domain  knowledge  through  properly  designed  clique  functions  associated  with 
the  pdf  of  the  underlying  MRF,  which  are  to  be  designed  in  such  a  way  that  the  optimal 
interpretation  found  is  most  consistent  with  domain  knowledge  and  measurements.  In 
particular,  we  have  proposed  a  structure  for  the  clique  functions  as  a  weighted  sum  of  basis 
functions  of  features  and  spatial  constraints.  Finally,  the  simulated  annealing  algorithm  is 
used  to  find  the  globally  optimal  interpretation. 

To  study  the  efficacy  of  the  MRF  model-based  approach,  image  interpretation  exper¬ 
iments  have  been  performed  on  both  synthetic  images  and  real-world  aerial  photographs. 
In  the  experiments,  we  have  found  the  piecewise-linear  basis  function  provides  robust  per¬ 
formance  for  the  interpretation  in  the  presence  of  measurement  errors  caused  by  imperfect 
segmentation.  We  have  also  found  that  selecting  powerful  features  and  constraints  are  very 
crucial  to  real-world  image  interpretation;  when  such  features  and  constraints  are  used, 
most  of  the  objects  in  the  image  are  correctly  recognized. 

Although  the  results  here  are  still  preliminary,  they  do  suggest  several  a  promising 
directions  for  future  research  work.  Specifically,  future  research  should  include  the  fol¬ 
lowing  three  immediate  research  tasks.  First  of  all,  a  more  general  approach  is  needed 
to  determine  the  weights  and  corner  points  for  the  piecewise-linear  basis  functions  used 
to  construct  of  the  clique  functions.  Secondly,  work  should  be  done  to  incorporate  other 
primitives,  such  as  linear  edge  segments,  and  high-level  to  low-level  feedback,  such  as  the 
split-and-merge  of  original  segmented  regions  during  interpretation,  into  the  MRF  model- 
based  approach,  as  described  in  Section  II.  Some  preliminary  results  for  these  two  task 
have  already  been  obtained  [27]-[28].  Finally,  the  MRF  model-based  approach  needs  to  be 
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tested  on  more  diverse  real-world  images,  such  as  additional  aerial  photographs. 
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Cable  7.2.1  -Summary  of  Assumed  Knowledge  for  the  Conceptual  Image 

a.  a  single  region 


Region  Knowledge 


Object  Area  Average 

Type  (No.  of  pixels)  Gray  Level 


Car 

S  800 

Sky 

2  25000 

Road 

i  11700 

Field 

2  13500 

c.  three  regions 
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Table  7.2.2  Definitions  of  Several  Region- Based  Features 
a.)  Features  for  a  Single  Region  R 

1.  Area: 

A  =  the  number  of  pixels  in  the  region  R. 


2.  Average  Gray  Level: 

G  =  jr  ^2  *(•»*)» 

«./)€* 

where  x(t,j)’s  are  gray  levels  of  the  pixels  in  R. 

3.  Standard  Deviation  of  Gray  Levels: 


5  = 


2  (s(*\i)-<?)2 


1/2 


4.  Compactness  (Kanai’s): 


where  P  is  the  perimeter  of  R 


P  =  the  number  of  boundary  points  of  R. 


5.  Partial  Compactness  (Yu’s): 


Cp  =  sample  standard  deviation  of  r, 

where  r  is  the  random  measurement  of  the  radius  of  region  R  as  shown  in  Fig.  5.20. 


b.) 

1. 


Features  for  Two  Adjacent  Regions  Ri  and  R}- 
Boundary  Length: 


Pi  +  Pi 

2 


2.  Contrast: 

Cij  =  \a,  -  G,-|. 
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Table  7.2.3  Linear  Basis  Functions  for  the  Synthetic  image 


a.  cliques  of  a  single  node 


Features 

Area 

Average  Gray  Level 

ai.  a2.  bi.  b2.  p 

ai.  a2.  bi.  b2.  p 

750,790.810.850.0.5 

165,175.185,187.0.5 

1 1 500. 1 1 600. 1 1 900.1 2000.0.5 

85.95.105  115.0.5 

19900.20000.65536.65536.0.5 

193.195.205.215.0.5 

13300.13400.13900,14000.0.5 

0.0.55.65.0.5 

b.  cliques  of  two  nodes 


Features 


Boundary  length 


ai,  a2.  bi,  b2,  p 


Contrast 


ai.  a2.  bi,  b2.  p 


85.95.105.115.0.5 


0.0.85.95.0.5 


0.0.55.65.0.5 


car,  road  105.115.125,135,0.5 


road,  field  140.150. 1000. 1000.0.5  35.45.55.65.0.5 


sky.  field  85.95.105.115.0.5  140,150.257,257.0.5 


all  others 


1.0.  1.0 


1.0.  1.0 


0.0.  1.0 


0.0.  1.0 


0.0.  1.0 


0.0.  1.0 


Labels 

Ofc.  p 

sky.  car.  field 

1.0.  i.O 

sky.  car,  road 

1.0.  1.0 

sky,  road,  field 

O 

o 

o 

road.  car.  field 

1.0.  l.o 

all  others 

0.5,  1.0 

Table  7.2.4  Knowing#  ano  Cuoua  functions  tc  ma  A#fi«i  Pnetes 


a  basis  function  to'  ciiouas  of  a  smgia  nooa 


Oil  tank  I  4500  4500  64000  64000 


105.110,140.145.0.5 


185.190.200.205.1.0 


200.205.256.256.0.33 


a.  Satis  function  for  cliouat  of  a  smgia  noda  (cont.) 


0.0.00. 10.0. 10.0. 1.0 


b.  basis  function  for  fsaturss  and  soatiai  constraints 
for  ciiquas  of  two  nodas 


Spatial 

Constraint 


Noda  or  Raglon 

Asaoctacad  Ciiquaa 

Rl 

vRj  > .  (Rr  ,R2; .  >RrRi}.  ;  R^  ,Rj} , 

*  R  ^  ^ ^  1  •  i  Rj » R, • R j / 

R2 

R3 

'R3‘ 

R4 

•:r4> 

c.  cliques  of  the  adjacency  graph 


Figure  7.2.1  A  Synthetic  Conceptual  Image  for  Image  Interpretation. 
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a.  a  Gaussian  function 


b.  Modest  i  no's  f  unct  i  on 
basic  form:  y2/(1+y2) 


r 


c.  ext  ensi  on  t  o  Modest  i  no's  f  unct  i  on 
basic  form:  yn/ (1  +  yn),  n*8 


Figure  7.2.2  Example  of  Different  Basis  Functions. 
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a.  a  single  node  clique 


b.  a  two-node  clique 


d.  a  four-node  clique 


Figure  7.2.3  Illustration  of  All  Possible  Cliques 
for  a  First-Order  Neighborhood  System. 
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a.  the  original  image 


b.  the  segmented  image 


the  interpreted  image 


Figure  7.Z.4  Interpretation  of  the  Ideal  Image 
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a.  the  original  image 


b.  the  segmented  image 


c.  the  interpreted  image  d.  gray  levels:  labels 


Figure 7.2.5  Interpretation  of  the  3dB  SNR  Image 
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a.  the  original  image 


b. 


the  segmented  image 


c.  the  interpreted  image  d.  gray  levels:  labels 


Figure 7.2.6  Interpretation  of  the  Ideal  Image 
with  an  Unknown  Object 
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a.  the  original  image 


b.  the  segmented  image 


c.  the  interpreted  image  d.  gray  levels:  labels 


Figure  7.2.7  Interpretation  of  the  3dB  SNR  Image 
with  an  Unknown  Object 


original  image 


b.  the  segmented  image 


interpreted  image  d.  gray  levels:  labels 


igure  7.2.8  Interpretation  of  the  Training  Image 


the  original  image 


b.  the  segmented  image 


the  interpreted  image  d.  gray  levels:  labels 


Figure7.2.9  Interpretation  of  the  Training  Image 
with  Computer  Segmentation 


a.  the  original  image 


b.  the  segmented  image 


a.  the  original  image 


b.  the  segmented  image 


c.  the  interpreted  image 


d.  gray  levels:  labels 


Figure?. 2.  ^Interpretation  of  the  Test  Image 

with  the  Partial  Compactness  Feature 
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7.3  A  Model-Fitting  Approach  to  Cluster  Validation  with  Application  to 

Stochastic  Model-Baaed  Image  Segmentation: 

7.3.1  Introduction: 

Clustering  procedures  have  found  wide  application  in  statistical  data  analysis  and 
processing.  The  application  of  specific  interest  here  is  stochastic  model-based  image  seg¬ 
mentation  where  a  clustering  algorithm  is  used  to  estimate  the  model  parameters  for  the 
various  image  classes  in  an  observed  image.  In  this,  and  similar  applications,  it’s  gen¬ 
erally  the  case  that  the  clustering  algorithm  requires  prior  knowledge  of  the  number  of 
clusters  or  data  classes.  For  many  applications,  however,  the  number  of  clusters  is  not 
known  a  priori  and  we  would  like  to  determine  it  directly  from  the  data.  This  is  known 
as  the  cluster  validation  problem.  For  stochastic  model-based  image  segmentation,  the 
solution  of  this  problem  directly  affects  the  quality  of  the  segmentation.  In  this  work,  we 
propose  a  model-fitting  approach  to  the  cluster  validation  problem  based  upon  Akaike’s 
Information  Criterion  (AIC).  The  explicit  evaluation  of  the  AIC  is  achieved  through  an 
approximate  maximum-  likelihood  (ML)  estimation  algorithm.  We  demonstrate  the  ef¬ 
ficacy  and  robustness  of  the  proposed  approach  through  experimental  results  for  both 
synthetic  mixture  data,  where  the  number  of  clusters  is  known,  and  to  stochastic  model- 
based  image  segmentation  operating  on  real-world  images,  for  which  the  number  of  clusters 
is  unknown.  This  approach  is  shown  to  correctly  identify  the  known  number  of  clusters  in 
the  synthetically  generated  data  and  to  result  in  good  subjective  segmentations  in  aerial 
photographs. 

7.3.2  Preliminaries: 

Clustering  procedures  are  widely  used  in  various  applications  of  pattern  classification 
and  statistical  data  analysis.  In  a  clustering  procedure,  the  observed  data  or  entities  are 
grouped  together  to  form  a  number  of  clusters  in  such  a  way  that  the  entities  within  a 
cluster  are  more  similar  to  each  other  than  to  those  in  other  clusters.  The  measure  of 
similarity,  usually  heuristically  defined,  is  called  the  cluster  criterion.  For  example,  the 
Euclidian  distance  can  be  used  as  a  similarity  measure  when  the  data  are  finite  dimensional 
vectors. 

For  the  past  three  decades,  many  clustering  algorithms  have  been  developed  by  re- 
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searchers  in  such  diverse  fields  as  biology,  statistical  data  analysis  and  pattern  recognition, 
using  very  different  cluster  criteria  [l].  In  some  previous  work  [2j-[4j  on  stochastic  model- 
based  image  segmentation,  clustering  algorithms  have  been  used  to  estimate  the  model 
parameter  vectors  for  different  image  classes  directly  from  the  observed  image.  Since  the 
nature  of  this  work  is  related  to  statistical  pattern  recognition,  the  clustering  algorithm 
used  was  selected  from  those  developed  within  the  pattern  recognition  community.  One  of 
the  most  successful  clustering  algorithms  in  this  respect  is  the  if-means  algorithm  [5], [6]. 
This  algorithm  is  optimum  in  the  sense  that  it  minimizes  the  variance  within  each  clus¬ 
ter  and  has  been  widely  used  in  unsupervised  pattern  recognition.  However,  an  important 
problem  existing  with  most  clustering  algorithms,  including  the  if -means  algorithm,  is  that 
the  number  of  clusters  in  the  data  must  be  specified  a  priori  before  using  the  clustering 
algorithm. 

In  some  situations  this  number  can  be  derived  from  prior  knowledge  about  the  data, 
or  sometimes  can  even  be  determined  from  visual  inspection  of  the  two-  dimensional  pro¬ 
jection  of  the  data.  However,  in  many  applications,  such  as  our  previous  work  on  image 
segmentation,  it  is  desired  to  estimate  this  number  directly  from  the  observed  data  since 
a  priori  knowledge  is  generally  not  available  and  the  data  are  often  vectors  of  dimension 
higher  than  two  so  that  the  projection  method  is  not  satisfactory.  Furthermore,  even  when 
the  data  is  two  dimensional,  visual  inspection  may  not  be  successful  if  the  data  clusters 
cannot  be  decided  by  observation.  This  problem  is  of  great  practical  importance  for  many 
clustering  algorithms  and  is  known  as  the  cluster  validation  problem  [7].  For  stochastic 
model-based  image  segmentation,  such  as  the  schemes  described  in  [2]-[4],  the  solution  of 
this  problem  directly  affects  the  quality  of  the  resulting  segmentation.  If  the  estimated 
number  of  clusters,  or  image  classes,  is  too  small,  different  homogeneous  regions  in  the  im¬ 
age  will  not  be  well  separated.  Likewise,  if  this  estimated  number  is  too  large,  a  relatively 
homogeneous  region  may  be  separated  into  a  number  of  smaller  regions.  Both  of  these 
situations  are  to  be  avoided. 

There  is  another  class  of  clustering  algorithms,  known  as  hierarchical  techniques,  that 
group  data  in  several  levels  of  nested  partitions  (clusterings).  While  conceptually  they 
do  not  require  a  priori  specification  of  the  number  of  clusters,  in  practice  they  require 
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specification  of  a  "cut”  level  to  choose  a  particular  partition  which  best  represents  the 
number  of  clusters  present.  The  problem  here  in  choosing  a  “cut”  level  is  then  analogous 
to  specifying  the  number  of  clusters  in  the  if-means  type  algorithms  and  the  solution  of 
the  later  extends  easily  to  the  former  as  discussed  in  [21]. 

Most  of  the  previously  proposed  solutions  to  the  cluster  validation  problem  can  be 
classified  into  two  categories:  heuristic  approaches  and  statistical  hypothesis  testing  ap¬ 
proaches.  In  the  heuristic  approach,  the  number  of  clusters  is  determined  by  using  some  ad 
hoc  criteria.  For  example,  for  the  if -means  algorithm  a  typical  approach  is  to  look  at  the 
plot  of  the  average  of  the  variances  within  the  clusters  under  assumptions  of  different  K, 
the  numbers  of  clusters.  The  value  of  K  corresponding  to  the  point  where  the  curve  begins 
to  saturate  can  then  be  taken  as  the  estimated  number  of  classes.  Many  ad  hoc  variations 
of  the  if-means  algorithms  have  been  proposed  based  on  similar  ideas.  In  these  algorithms, 
the  number  K  is  increased  or  decreased  according  to  criteria  such  as  intra-cluster  variance 
and  distance  between  clusters  (e.g.,  the  ISODATA  algorithm  in  [6]). 

An  example  of  more  sophisticated  heuristic  techniques  is  the  bootstrap  scheme  for 
cluster  validation  proposed  recently  by  Jain  and  Moreau  [21].  In  their  approach,  a  number 
of  bootstrap  data  sets  are  first  generated  from  the  original  data  set.  Then,  under  the 
assumption  of  different  number  of  clusters,  the  variance  of  a  heuristically  defined  statistic 
is  computed  on  bootstrap  data  sets.  The  assumed  number  of  clusters  that  results  in 
the  least  variance,  or  believed  to  provide  the  most  stable  clustering,  is  chosen  as  the 
estimate  of  the  number  of  clusters.  Since  different  clustering  procedures  may  give  different 
solutions  using  this  technique,  Jain  and  Moreau  have  also  defined  a  heuristic  index  to 
determine  whether  the  estimate  obtained  from  a  particular  clustering  procedure  is  valid. 
This  approach  has  been  demonstrated  to  provide  effective  identification  of  the  number  of 
clusters  for  synthetic  data  and  real  data  from  simple  range  images.  However,  as  will  be 
discussed  later,  this  method  also  has  some  limitations. 

Several  comprehensive  survey  studies  of  various  heuristic  techniques  can  be  found 
in  [7], [21], [22], [41], [42].  While  some  practical  problems  can  be  solved  using  the  heuristic 
approach,  it  does  not  provide  a  general  solution  to  the  cluster  validation  problem  and,  even 
when  applied  to  specific  problems,  many  techniques  have  to  be  fine-tuned  through  trial- 
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and-error.  This,  in  part,  reflects  the  difficult  nature  of  the  problem.  More  specifically, 
as  pointed  out  by  Everitt  [l]  and  Jain  [7],  clusters  are  generally  very  difficult  to  define 
precisely. 

To  find  generally  applicable  and  mathematically  rigorous  solutions  to  cluster  vali¬ 
dation,  many  researchers  have  tried  to  formulate  the  problem  as  a  statistical  hypothesis 
testing  problem  [8],[9],[41]-[43].  For  example,  hypothesis  tests  have  been  proposed  to  test 
whether  a  given  cluster  should  be  divided  into  two.  More  general  likelihood  tests  have  been 
attempted  with  the  data  modeled  in  terms  of  finite  mixture  distributions  [9].  However, 
due  to  the  structure  of  the  mixture  distribution,  the  parameters,  which  characterize  one 
hypothesis  (for  example,  the  null  hypothesis)  are  at  the  boundary  of  the  parameter  space 
of  the  other  hypothesis.  This,  in  turn,  violates  the  regularity  conditions  (cf.  [9])  which  are 
required  for  the  validity  of  the  asymptotic  distribution  theory  for  the  generalized  likelihood 
ratio  (GLR)  test  which  exists  for  many  simple  hypothesis  testing  situations  where  each 
of  the  hypotheses  can  be  described  as  a  single  probability  distribution.  As  a  result,  no 
generally  applicable  GLR  test  is  available  at  this  point  to  determine  the  number  of  clusters 
directly  from  observation  data. 

On  the  other  hand,  the  problem  we  face  is  not  unlike  the  one  faced  in  developing  a 
theory  to  fit  an  autoregressive  (AR)  model  to  real-world  data  in  which  the  order  of  the 
model  has  to  be  decided  before  the  model  parameters  can  be  estimated  from  the  data. 
Having  observed  that  neither  heuristic  nor  hypothesis  testing  approaches  alone  would 
provide  a  satisfactory  solution  to  determining  the  order  of  the  model,  hence  the  practical 
fitting  of  a  model  to  observation  data,  Akaike  [10]  suggested  that  the  problem  should 
be  viewed  as  a  multiple  decision  problem.  That  is,  rather  than  asking  which  hypothesis 
is  acting  (which  order  is  correct),  we  should  ask  which  model  best  fits  the  data.  The 
goodness  of  fit,  as  pointed  out  later  by  Akaike  [11],  should  be  a  properly  defined  entropy 
function  and  the  best  fit  should  be  obtained  by  maximizing  this  quantity.  Based  on 
this  maximum-entropy  principle,  Akaike  proposed  a  criterion,  generally  called  the  AIC 
( Akaike ’s  Information  Criterion),  to  determine  both  the  order  and  the  parameters  of  an 
AR  model  for  observed  data.  Although  there  have  been  some  criticisms  of  the  AIC  as  being 
inconsistent,  Akaike  showed  that  the  AIC  is  robust  and  optimal  in  a  minimax  sense.  That 
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is,  it  is  optimal  when  there  is  no  a  priori  knowledge  about  the  distribution  of  the  model 
parameters.  In  addition,  Akaike  and  others  also  extended  the  AIC  to  several  Bayesian 
variations  called  the  BIC  (Bayesian  Information  Criterion)[l3],[l4].  This  class  of  criteria 
can  be  shown  to  be  AIC’s  averaged  with  respect  to  various  a  priori  distributions  for  the 
model  parameters.  Although  the  AIC  criterion  and  its  variations  have  achieved  substantial 
success,  mostly  in  AR  model  fitting,  their  application  is,  of  course,  not  limited  to  AR  time 
series  modeling. 

There  has  been  little  previous  work  on  the  application  of  the  AIC  to  cluster  validation. 
Sclove  [17]  demonstrated  a  way  to  use  the  AIC  to  verify  image  segmentation  results.  After 
segmenting  a  synthetic  image  under  the  assumption  of  two  and  three  classes,  the  AIC  was 
used  to  verify  that  the  segmentation  with  three  classes  is  a  better  segmentation. 

In  this  work,  we  have  applied  the  AIC  directly  to  cluster  validation  and  to  stochastic 
model-based  image  segmentation.  Our  study  has  been  conducted  on  both  synthetic  data 
and  real-world  image  data.  Synthetic  data  is  used  to  illustrate  the  efficacy  and  robust¬ 
ness  of  the  AIC.  In  particular,  we  show,  through  experimental  results,  that  the  AIC  is 
effective  in  correctly  identifying  the  number  of  classes  when  the  data  is  generated  from 
well-defined  Generalized  Gaussian  mixtures  even  for  small  inter-cluster  distances,  or  large 
cluster  overlap.  Furthermore,  the  AIC  can  be  made  quite  robust  against  model  mismatch. 
In  the  application  to  image  segmentation,  we  use  the  AIC  to  determine  the  number  of 
distinct  image  classes  modeled  by  parametric  probability  models.  Our  work  is  different 
from  Sclove’s  [17]  in  that  we  apply  the  AIC  explicitly  to  the  cluster  validation  problem  and, 
in  the  application  to  image  segmentation,  we  use  the  AIC  to  decide  the  proper  number  of 
classes  in  an  image  before  segmentation. 

This  paper  is  organized  as  follows.  In  the  next  section,  we  will  formulate  the  cluster 
validation  problem  as  a  mixture  model-fitting  problem  and  describe  how  to  determine  the 
number  of  clusters  by  using  the  AIC.  Then,  in  Section  7.3.4  and  7.3.5,  we  discuss  how  the 
AIC  approach  can  be  applied  to  cluster  validation  and  image  segmentation,  respectively.  In 
Section  8/3/7,  we  demonstrate  some  experimental  results  in  which  the  number  of  clusters, 
or  classes,  is  determined  for  synthetic  mixture  data  and  images  using  the  AIC  criterion. 
We  will  also  illustrate  real-world  image  segmentation  results  obtained  with  the  number  of 
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classes  determined  by  the  A1C.  Finally,  a  summary  and  conclusions  are  provided  in  Section 
7.3.7. 

7.3.3  The  AIC  Criterion  for  Cluster  Validation: 

Basically,  we  want  to  determine  the  number  of  clusters  by  finding  the  best-  fitting 
random  mixture  distribution  model  to  the  data  according  to  the  AIC  criterion.  Suppose 
that  the  sample  data  can  be  represented  by  N  independent  and  identically  distributed 
(i.i.d.)  random  vectors,  Y  =  {y  1,  ya, . . . ,  yjv}<  Furthermore,  assume  that  the  data  samples 
are  to  be  described  by  a  mixture  distribution.  That  is,  for  any  yeY, 

K 

p(y)  =  (i) 

kstl 

where  the  pjt(y)’8  are  individual  mixture-component  or  class  probability  density  functions 
(pdf’s)  with  Xfc,  the  weights,  satisfying 

irk  >  0,  far  k  =  1,2,..., if;  (2a) 

and  x 

=  (2b) 

fcsl 

Now  the  problem  of  cluster  validation  can  be  considered  as  determining  K,  the  number 
of  mixture  components,  from  the  observed  data  where  each  component  corresponds  to  a 
distinct  cluster.  Assuming  that  the  functional  form  of  the  component  pdf’s,  p*(-),  are 
properly  selected,  we  have  formulated  the  problem  of  cluster  validation  as  that  of  finding 
the  best-fitting  mixture  model  for  the  data  Y.  The  goodness-of-fit  employed  here  is  a 
properly  defined  information  theoretic  criterion,  known  as  the  AIC  [10],  defined  by 

AIC(K)  =  —  2/off  {maximum-likelihood  of  the  model  (if)}  +  2  K\  (3a) 

where 

maximum-likelihood  of  the  model  ( K )  =  p(Y|a^)-  (3b) 

Here,  aj^2  is  the  maximum-likelihood  (ML)  estimate  of  the  model  parameter  vector,  a^, 
of  the  mixture  model  with  K  components.  Usually,  contains  the  component  weights 
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in  (2)  and  model  parameters  for  each  of  the  component  pdf’s,  while  K'  in  (3a)  is  the 
number  of  independently  adjustable  parameters  of  the  if -component  mixture  model.  The 
AIC  approach  will  select  the  number  of  clusters  to  be  Kq,  if 

Ko  =  arg  i<;jimn^  AIC{K ),  (4) 

where  K^nx  is  a  pre-specified  upper-limit  for  K.  Using  this  mmtmum  AIC  principle,  we 
can  compute  the  AIC(K)  for  K  —  1,2 and  determine  Kq  according  to  (4). 
This  approach,  while  shown  to  be  a  maximum-entropy  principle  [10], [11],  has  a  simple 
heuristic  appeal.  That  is,  if  two  models  are  about  equally  likely,  the  AIC  will  select  the 
one  with  smaller  number  of  clusters. 

Compared  with  the  previously  proposed  heuristic  techniques,  the  AIC  provides  some 
potential  advantages.  Take  Jain  and  Moreau’s  heuristic  technique  [21]  as  an  example. 
First  of  all,  although  the  heuristics  used  are  quite  effective  in  the  experimental  results 
provided  in  [21],  little  is  known  concerning  how  they  would  work  in  general;  for  example, 
under  some  specific  stochastic  modeling  assumptions  on  the  data.  Secondly,  their  method 
is  nonparametric  while,  in  many  applications,  reasonable  modeling  assumptions  can  often 
be  made  on  the  data.  As  a  result,  parametric  methods  such  as  the  AIC  which  make  use 
of  more  priori  information  may  provide  a  more  precise  description  of  the  data.  Thirdly, 
a  model-based  approach,  such  as  the  AIC,  provides  a  more  unified  and  general  approach 
to  the  cluster  validation  problem.  For  example,  to  achieve  effective  identification  of  the 
number  of  clusters  for  different  types  of  data  sets,  such  as  hyperspherical  mixtures,  circular 
mixtures  or  enlongated  mixtures,  Jain  and  Moreau  employed  different  heuristic  statistics 
as  criteria  for  cluster  validation  on  a  case-by-case  basis  and  it  does  not  seem  clear  how 
this  could  be  done  for  other  types  of  mixtures.  On  the  other  hand,  using  a  model-based 
approach,  such  as  the  AIC  approach,  appropriate  pdf  models  can  be  selected  for  different 
types  of  mixture  data  and  the  criterion,  in  all  cases,  is  based  on  likelihood  functionals. 
Finally,  the  computation  of  the  AIC  is  basically  ML  estimation  which  requires  only  a 
moderate  amount  of  computation  without  using  any  bootstrap  data  sets  which  can  lead 
to  quite  computationally  intensive  procedures. 


To  apply  the  AIC  approach  to  a  given  set  of  random  data,  we  need  to  select  a  mixture 
distribution  model,  estimate  model  parameters  and,  finally,  compute  the  AIC  for  different 
K.  We  discuss  each  of  these  issues  in  turn. 

A.)  The  Generalized  Gaussian  Mixture : 

For  simplicity,  we  further  asstime  the  data  vectors  are  m-dimensional  and  have  condi¬ 
tionally  independent  components  given  the  class.  That  is,  with  y  =  . .  •  ,y^}, 

the  joint  pdf  given  class  k  is  acting  is  described  by 

Pfc(y)  =  k=h2,...,K.  (5) 

»=i 

Then  the  mixture-component,  or  class,  pdf’s  can  be  specified  through  the  component  pdf’s 
of  the  data  vectors.  While  the  AIC  is  applicable  under  quite  general  mixture  models  as  can 
be  seen  from  its  definition  in  (3),  in  this  work  we  have  used  a  particular  class  of  mixture 
models,  known  as  the  generalized  Gaussian  mixture,  to  demonstrate  the  procedure  of 
applying  the  AIC  to  cluster  validation  and  image  segmentation.  Let  y(*),t  =  1,2, ...,m 
be  a  component  of  data  vector  yeY.  Then  a  Gaussian  component  pdf  given  class  k  is 
described  by 


where  mj^ ,  ak'  are  the  mean  and  variance,  respectively.  Similarly,  a  generalized  Gaussian 
componenent  pdf  given  class  k  is  defined  for  a  >  0  by 


Pk](v(i))  =  2$feeXp  ~  >  (7a) 

where  mj^  is  the  mean,  T(*)  is  the  Gamma  function,  and  rj^  is  a  parameter  related  to 
the  variance,  ok’  ,  by 


_(<) _ 1  [r(3/a) 

-  cf  lr(l/a) 


When  a  =  2,  we  have  the  Gaussian  pdf;  when  a  =  1,  we  have  the  Laplacian  or  exponential 
pdf.  Plots  of  a  number  of  typical  parametized  one-  dimensional,  zero-mean,  generalized 
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Gaussian  pdf’s  are  shown  in  Fig.  7.3.1  for  different  values  of  a.  When  a  >  1,  the 
distribution  tends  to  a  uniform  pdf;  when  a  <  1,  the  pdf  tends  to  be  more  peaked  around 
the  mean  and  also  have  heavier  tails.  The  latter  case  results  in  many  “outliers”  associated 
with  the  corresponding  class  or  cluster.  While  the  Gaussian  mixture  has  often  been  used 
to  model  cluster  data,  we  use  the  generalized  Gaussian  model  as  a  convenient  model  for 
studing  outliers  in  clusters  and  evaluating  the  relative  robustness  of  the  AIC. 

B.)  Estimation  of  Model  Parameters: 

After  the  mixture  model  is  selected  for  the  data,  the  computation  in  the  AIC  of 
(3)  mainly  involves  the  ML  estimation  of  the  model  parameters.  The  ML  estimation 
approach  has  been  a  very  successful  method  in  stochastic  model  parameter  estimation 
for  the  pdf’s  which  contain  only  one  component.  Explicit  solution  can  often  be  found 
by  solving  the  appropriate  likelihood  equation  and  the  ML  estimate  in  many  cases  is 
consistent  [18].  Even  in  the  case  where  the  true  distribution  of  the  data  is  not  the  same 
as  the  model,  the  consistency  property  often  still  holds  under  mild  regularity  conditions 
[19].  This  result  is  especially  important  since,  when  we  try  to  use  a  model  to  approximate 
an  unknown  probability  distribution  using  ML  estimation,  it’s  desirable  that  the  estimates 
be  consistent.  Unfortunately,  some  of  these  results  do  not  readily  extend  to  mixture 
distributions  [9].  First  of  all,  explicit  solution  is  impossible  even  for  the  two-component 
case.  Secondly,  the  likelihood  surface  often  has  singularity  points  which  makes  numerical 
solution  difficult.  A  major  reason  for  this  is  that  the  data  is  incomplete  in  the  sense  that 
we  do  not  know  a  priori  to  which  cluster  a  data  vector  belongs.  However,  a  number  of 
approximate  ML  algorithms  for  this  situation  do  exist.  One  of  the  more  popular  methods 
is  the  so-called  EM  (expected  maximum)  algorithm  [9],  [24].  It  has  been  shown  that  under 
mild  regularity  conditions  it  does  provide  local  maxima  that  are  consistent  [24].  However, 
a  disadvantage  of  the  EM  algorithm  is  its  relatively  slow  convergence  [24]. 

In  this  work,  we  use  an  approximate  ML  estimation  scheme  using  a  clustering  algo¬ 
rithm.  First  of  all,  the  if-means  clustering  algorithm  is  applied  to  the  data  to  divide  the 
data  into  K  groups.  Then  each  group  is  assumed  to  correspond  to  the  sample  data  for 
one  and  only  one  mixture  component.  An  ML  estimate  is  then  evaluated  on  each  group 
separately  to  estimate  the  parameters  for  the  corresponding  mixture  component.  Finally, 
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the  component  weights,  k  =  1,2,..., if,  can  be  estimated  as  the  ratio  of  the  number 
of  samples  in  a  group  to  the  total  number  of  samples.  This  approximation  transforms 
the  problem  of  ML  estimation  of  a  mixture  to  that  of  ML  estimation  of  several  individual 
pdf’s.  In  the  next  section,  we  demonstrate,  through  experimental  results,  that  it  provides 
reasonably  good  estimates.  This  scheme  also  converges  fast  since  the  underlying  clustering 
algorithm  is  known  to  possess  fast  convergence  properties. 

There  are  two  points  that  need  to  be  noted  in  using  the  if-means  algorithm  for 
mixture  estimation.  First,  it  has  been  pointed  out  by  Titterington  [23],  among  others,  that 
theoretically,  the  if -means  algorithm  results  in  asymptotically  biased  estimates.  However, 
we  found  that  this  did  not  seem  to  effect  the  performance  of  the  AIC  approach.  As  a  matter 
of  fact,  the  if -means  algorithm  provides  reasonably  good  estimates  and  the  AIC  computed 
using  this  algorithm  is  quite  effective  in  identifying  the  number  of  classes  correctly  in  a 
variety  of  experiments  to  be  described  later.  In  addition,  compared  to  the  EM  algorithm, 
the  if-  means  algorithm  is  computationally  more  efficient.  Hence,  in  the  results  of  this 
paper,  we  have  used  the  if -means  algorithm  exclusively,  although  the  use  of  the  EM 
algorithm  for  the  AIC  is  currently  under  investigation.  The  second  point  is  how  to  chose 
the  initial  cluster  centers,  or  seeds ,  when  using  the  if -means  algorithm  since  the  result  of 
the  clustering,  being  locally  optimum,  often  depends  on  the  choice  of  seeds.  For  example, 
Jain  and  Moreau  [21]  suggest  using  several  different  sets  of  randomly  selected  seeds  when 
the  if -means  algorithm  is  used  for  a  given  specification  of  the  number  of  clusters.  While 
this  improves  the  data  clustering,  or  classification,  the  amount  of  computation  needed  is 
increased  drastically.  In  our  experiments,  we  have  found  that  the  results  of  parameter 
estimation  and  the  subsequent  cluster  identification  using  the  AIC  is  relatively  insensitive 
to  whether  the  K-  means  algorithm  is  used  with  one  set  or  more  than  one  set  of  random 
seeds.  Hence,  in  all  the  experiments,  we  use  only  one  set  of  seeds  selected  from  the 
data  randomly  when  using  the  if -means  algorithm  for  a  given  K.  In  summary,  the  if- 
means  algorithm  is  used  as  a  computationally  efficient  and  reasonably  accurate  estimation 
procedure.  While  it  works  well  for  our  purposes,  we  do  not  claim  it  is  the  best  procedure. 

In  the  above  clustering-estimation  procedure,  once  the  data  are  classifed,  the  model 
parameter  estimation  problem  reduces  to  obtaining  ML  estimates  for  the  parameters  of 
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each  of  the  mixture  components.  For  example,  in  the  Gaussian  mixture  case,  it  is  well- 
known  [18]  that  the  ML  estimate  of  the  mean  is 


where  we  suppress  the  subscript  “ML”  for  convenience,  while  similarly 


J=1 


(8a) 


(8b) 


represents  the  ML  estimate  of  the  variance.  Here,  Nk  is  the  number  of  samples  assigned 
to  class  k,  while  the  y^’s  represent  the  t’th  component  of  the  sample  vectors  assigned  to 
class  k.  The  ML  estimate  of  the  mean  of  the  generalized  Gaussian  is  the  same  as  (8a), 
while  the  ML  estimate  of  the  variance  can  be  conveniently  related  to  the  ML  estimate  of 
*7  through  (7b).  More  specifically,  in  Appendix  A,  we  show  that 


= 


r(3/a) 


r(l/a)J  I  Nk 


3=1 


2/a 


(9) 


which  clearly  reduces  to  (8b)  for  a  =  2. 

C.)  Computation  of  the  AIC: 

There  are  two  applicable  expressions  for  the  likelihood  functional  when  using  the  AIC 
criterion.  If  we  consider  the  data  vectors  to  be  incomplete ,  that  is,  the  class  status  of  the 
samples  is  unknown,  we  will  have  the  standard  likelihood  expression  for  the  mixture  which, 
from  (1),  becomes 


N  JC 

PJc(Y|a)  =  I]  £  **Pfc(y»')-  (10) 

ial  k=  1 

On  the  other  hand,  if  we  first  classify  the  data  by  applying  the  if-means  algorithm,  we  in 
effect  assign  data  vectors  to  hypotheses  classes.  In  this  case  a  data  vector  assigned  to  class 
k  can  be  considered  coming  from  a  particular  class  and  has  a  probability  irk  of  occurring. 
The  corresponding  expression  for  the  likelihood  functional  for  correctly  classified  samples, 
or  complete  data,  then  becomes 
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(11) 


p*(Yia)  =*  n  n  p*(y*>)» 

fc=i  y=i 

where  JV*  <  N  is  the  number  of  samples  in  the  kth  cluster  and  y*y,  j  =  1, 2, Nk  are  data 
vectors  associated  with  this  cluster  [9],  [23].  Since  we  have  used  the  if- means  algorithm 
for  approximate  ML  estimation,  each  sample  vector  is  assigned  to  a  unique  class.  In  what 
follows,  we  will  make  use  of  the  second  likelihood  functional  as  expressed  by  (11).  The  ML 
estimate,  aj^2» to  be  used  in  (3)  in  computing  AIC(K)  is  then  formed  from  the  resulting 
K  class-conditional  parameter  estimates  together  with  the  estimated  weights  as  described 
above. 

Now,  the  application  of  the  above  model-fitting  approach  to  random  data  to  determine 
the  number  of  clusters  is  straightforward ,  as  described  by  the  following  steps: 

1. )  Start  with  K  =  1. 

2. )  For  each  K  =  1,2, ... ,  Kmw ,  compute  AIC(K)  using  the  approximate  ML  esti¬ 

mation  procedure. 

3. )  Announce  that  there  are  Kq  clusters  if  AIC(Kq)  is  the  minimum  among  all 

A/C  (if) ’s  obtained  in  step  2.). 

7.3.5  Application  of  the  AIC  to  Image  Segmentation: 

The  model  fitting  approach  to  cluster  validation  can  also  be  applied  to  stochastic 
model-based  image  segmentation.  In  a  stochastic  model-based  approach,  an  image  is 
segmented  into  regions  of  different  statistical  properties,  or  classes,  that  are  modeled  by 
parametric  random  field  models.  In  [27],  we  have  considered  simple  independent  Gaussian 
random  field  models  for  the  image  classes.  More  specifically,  let  the  image  be  a  two- 
dimensional  (2-D)  array  on  the  lattice  X,  denoted  by  x,  and  described  by 

x  =  {x(i,j),{i,j)eL},  L  =  {(m,n),l  <  m,n  <  N}.  (12) 

Then  an  independent  Gaussian  random  field  can  be  represented  by 

x(i,j)  =  f  +  w(i,j);  (i,j)eL,  (13) 
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where  /  is  a  constant  mean,  w(i,j)  is  a  2-D  zero-mean  Gaussian  white  noise  process  with 
variance  a\.  In  this  paper,  we  will  also  consider  two  other  image  models;  the  AR  model 
and  the  Markov  random  field  (MRF)  model  [28]-[30].  In  particular,  we  consider  a  simple 
AR  model,  known  as  the  first  quarter-plane  filter  (FQPF)  configuration,  represented  by 


x(i,j)  =  ax(i  -  l,j  -  1)  +  bx(i  -  1  ,j)  +  cx{i,j  -  1)  +  w(s,i);  (*,j)eL,  (14) 

where  a,  b,  e  are  constant  model  coefficients,  representing,  respectively,  the  diagonal,  hori¬ 
zontal  and  vertical  correlation  between  pixels,  and  w(i,j)  is  again  a  2-D  zero-mean  white 
Gaussian  noise  process  with  variance  cr*.  The  MRF  model  considered  in  this  paper  is  also 
very  simple,  known  as  the  binary  isotropic  auto  (BIA)  model  [28].  It  is  represented  by  the 
joint  pdf 


p(x)  =  Z~lexp[-U(x )],  (15a) 

where  U  (•)  is  the  Gibbs  energy  functional,  defined  as 

tf(x)  =  a£x*(i,j)  +b£x(i,j)[x(i  -  1  ,j)  +  x(i,j  -  1)],  (15b) 

i,J  *♦/ 

where  the  x(»,j)  terms  are  binary  and  Z  is  a  normalization  factor  to  make  (15a)  a  valid 
pdf.  For  this  model,  a  and  b  are  the  constant  model  parameters  that  control  the  average 
level,  and  the  horizontal  and  vertical  correlation  of  the  pixels,  respectively.  Since  our 
purpose  here  is  to  demonstrate  the  applicability  of  the  AIC  to  different  image  models,  the 
models  selected  are  rather  simple.  However,  the  method  developed  here  is  applicable  to 
more  complex  image  models  as  long  as  they  are  parametric. 

Under  appropriate  stochastic  modeling  assumptions  for  the  different  image  classes,  the 
image  segmentation  problem  can  be  formulated  as  a  statistical  decision  problem,  where 
each  pixel  is  assigned  to  one  of  a  finite  number  of  classes.  Typical  techniques  of  stochastic 
model-based  segmentation  include  ML  and  MAP  segmentation  procedures  [2]-[4],  [3l]-[37]. 
Before  applying  such  a  technique  to  an  image,  however,  the  model  pa  < meters  for  each 
class  have  to  be  estimated.  In  an  unsupervised  approach,  which  is  the  concern  here,  these 
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parameters  are  estimated  directly  from  the  image.  However,  to  estimate  the  model  pa¬ 
rameters,  we  first  need  to  know  how  many  classes  there  are  in  the  image.  The  AlC-based 
model-fitting  approach  described  previously  can  be  applied  directly  in  determining  the 
number  of  classes  in  an  image.  Indeed,  in  this  case,  the  algorithm  outlined  previously  for 
computing  the  AIC’s  can  be  applied  once  the  corresponding  random  data  vectors  are  spec¬ 
ified  and  obtained.  Rather  than  using  all  the  pixels  in  the  image  (there  are  just  too  many), 
the  sample  data  vectors  are  chosen  to  be  the  estimated  model  parameter  vectors  obtained 
from  different  spatial  positions  of  a  sliding  estimation  window  on  the  image,  as  shown 
in  Fig.  7.3.2.  Here,  determine  the  size  of  the  rectangular  window,  while  M2,  iV2 

are  the  vertical  and  horizontal  displacements,  respectively,  between  two  adjacent  spatial 
window  positions.  Hence,  we  perform  cluster  validation  in  the  model  parameter  vector 
space  instead  of  the  pixel  space.  The  number  of  clusters,  or  image  classes,  would  then 
be  the  number  of  distinct  clusters  of  different  model  parameter  vectors.  This  scheme  was 
proposed  in  our  previous  work  [27],  under  the  assumption  that  the  image  classes  are  mod¬ 
eled  as  independent  Gaussian  random  fields;  in  which  case  the  model  parameter  vectors 
are  two-dimensional,  consisting  of  only  the  mean  and  variance  computed  within  a  sliding 
estimation  window  on  the  image.  In  this  paper,  we  will  continue  to  use  this  scheme.  How¬ 
ever,  we  will  also  consider  more  complex  models  for  the  image  classes  such  as  the  AR  and 
the  MRF  models  in  which  case  the  model  vectors  need  no  longer  be  two-dimensional.  For 
example,  the  FQPF  AR  model  has  four-dimensional  model  parameter  vector  (a,6,c,ow); 
the  BIA  MRF  model,  on  the  other  hand,  has  two-dimensional  model  parameter  vector 
(a,  b).  Again,  we  emphasize  that  this  scheme  does  not  have  any  restrictions  on  the  models 
for  the  image  classes,  as  long  as  they  are  parametric.  Once  the  model  parameter  vectors 
are  so  obtained,  the  minimum  AIC  principle  described  above  readily  applies. 

7.3.6  Experimental  Results  and  Discussion; 

In  this  section,  we  will  present  and  discuss  some  experimental  results  on  the  model¬ 
fitting  approach  utilizing  the  AIC  applied  to  synthetic  mixture  data  and  various  stochastic 
and  real-world  image  data.  For  synthetic  data,  we  concentrate  on  the  performance  of  the 
AIC  as  a  cluster  validation  criterion,  e.g.,  its  efficacy  under  different  data  distributions, 
its  performance  under  various  degrees  of  overlap  of  the  clusters  and  its  robustness  under 
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model  mismatch  (in  the  sense  when  the  model  structure  assumed  in  the  AIC  is  different 
from  the  actual  data  distribution).  More  specifically,  we  divide  the  study  on  synthetic  data 
into  three  parts.  In  the  first  part,  we  apply  the  AIC  to  identify  the  number  of  clusters 
in  different  Gaussian  mixtures  where  the  data  clusters  are  relatively  well  separated.  In 
the  second  part,  we  study  the  performance  of  the  AIC  for  generalized  Gaussian  mixtures 
under  varing  degrees  of  cluster  overlap.  In  the  third  part,  we  study  the  relative  performance 
when  the  assumption  of  the  model  structure  in  the  AIC  is  different  from  the  actual  data 
distributions,  again  using  the  generalized  Gaussian  mixtures  for  illustration. 

The  study  of  the  AIC  on  image  data  concerns  the  identification  of  the  number  of 
image  classes  in  synthetic  or  real  images,  where  the  image  classes  are  modeled  as  random 
fields  such  as  Gaussian,  AR  and  MRF’s.  In  this  part,  we  have  used  both  synthetic  images 
and  real-world  images.  The  synthetic  images  are  realizations  of  mixtures  of  the  previously 
mentioned  random  field  models  and  the  number  of  clusters  is  specified  when  they  are 
generated.  In  the  experiments,  these  synthetic  images  are  used  to  demonstrate  the  efficacy 
of  the  AIC  in  identifying  the  number  of  classes  correctly  when  the  number  of  classes 
is  well  defined.  For  the  real-world  images  we  have  used  aerial  photographs.  In  these 
images,  usually  the  true  number  of  classes  is  not  clear  or  well-defined  as  in  the  case  of 
synthetic  images  and  the  AIC  is  used  to  provide  an  objective  way  of  determining  the 
number  of  classes  as  opposed  to  subjective  assessment.  Notice  that  in  this  case,  the  various 
classes  are  statistical  models  which  may  or  may  not  correspond  to  distinct  real-world 
objects.  The  AIC  criterion  is  used  only  to  help  segment  the  image  into  regions  that  are 
statistically  different ,  while  object  recognition  and  global  image  interpretation  are  usually 
performed  subsequently  using  the  segmented  image  together  with  domain  knowledge  and 
measurements  made  on  the  segmented  regions  [37]. 

A.)  Synthetic  Data:  Gaussian  Mixtures: 

In  this  experiment,  three  two-dimensional  (m  =  2)  Gaussian  mixture  data  sets  with 
two,  three  and  four  components,  or  clusters,  are  generated  as  shown  in  Fig.  7.3.3.  We 
choose  the  data  to  be  two-dimensional  since  it’s  then  easy  to  display  on  a  plane.  There  are 
two  objectives  of  this  experiment:  first,  to  see  if  the  approximate  ML  estimates  provide 
a  reasonable  estimate  of  the  true  model  parameters  and,  secondly,  to  see  whether  the 
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AIC  provides  correct  estimates  of  the  number  of  clusters,  even  in  this  idealized  case.  The 
results  of  the  parameter  estimates  for  all  the  test  data  sets  under  the  correct  assumptions 
on  the  number  of  clusters  are  shown  in  Table  7.3.1.  It  can  be  observed  that  when  the 
assumption  of  the  number  of  classes  acting  corresponds  to  the  true  but  unknown  value,  the 
parameter  estimates  are  quite  accurate.  This  is  clearly  the  case  here  for  the  data  clusters 
which  are  reasonably  well-  separated,  and  also  true  for  other  experiments  on  synthetic 
data,  described  below,  where  the  clusters  are  overlaped.  The  if -means  algorithm,  which 
may  not  be  unbiased  theoretically,  does  provide  a  reasonably  effective  and  fast  procedure 
for  mixture  estimation  and  likelihood  computation.  This  indicates  that  the  approximate 
ML  estimation  scheme  using  clustering  is  quite  effective.  In  Table  7.3.2,  we  have  shown 
the  AIC’s  computed  for  all  the  test  data  under  the  assumptions  of  different  numbers  of 
clusters,  with  Kma*  =  8.  We  find  that  the  AIC  does  make  correct  decisions  each  time. 
This  indicates  that  when  the  data  is  indeed  a  Gaussian  mixture  and  the  clusters  are  well- 
separated,  the  method  proposed  here  tends  to  estimate  the  number  of  clusters  correctly. 
Additional  examples  are  given  for  a  much  larger  variety  of  Gaussian  mixtures  in  [19]  with 
similar  results. 

B.)  The  Performance  of  the  AIC  versus  the  Distance  between  Clusters : 

Here,  we  are  interested  in  the  performance  of  the  AIC  on  a  variety  of  data  sets  as  a 
function  of  the  distance  between  clusters;  in  particular,  when  the  distance  is  successively 
reduced.  For  this  purpose,  synthetic  two-dimensional  random  mixture  data  are  generated 
from  a  number  of  generalized  Gaussian  mixture  models.  For  simplicity,  in  the  generalized 
Gaussian  mixture  model  for  each  data  set,  there  are  three  mixture-components  with  equal 
component  probabilities.  That  is,  for  each  data  set,  Ktruth  —  3,  and  jti  =  jt3  =  tt3  = 
1/3.  In  addition,  we  set  the  variances  of  each  mixture-component  to  be  1.0  for  all  the  data 
sets;  that  is,  <7^  =  1.0,  k  =  1,2,3;  i  =  1,2,  for  all  data  sets.  In  Fig.’s  7.3.4-7.3.7, 
we  have  shown  the  16  data  sets  generated  for  this  experiment.  In  each  figure,  there  are 
four  data  sets  generated  with  the  same  a,  but  with  different  distances  between  the  three 
clusters.  Hence,  the  data  sets  in  these  figures  are  different  in  two  ways.  From  one  figure  to 
another,  they  differ  in  a,  the  characteristic  exponent  of  the  underlying  generalized  Gaussian 
model;  within  each  figure,  they  differ  in  d,  the  normalized  distance  between  clusters.  The 
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normalized  distance,  d,  for  a  data  set  is  defined  as 


maxij  D(Ci,Cj) 

sfWj  ’ 

where  c*  is  the  mean,  or  cluster  center,  of  the  fc’th  cluster  given  by 


c*  =  mfc  =  *  =  1,2,3, 


and 


(16a) 


(16b) 


•D(ei.«j)  =  ll'i-'jll  (16c) 

is  the  Euclidian  distance  between  two  cluster  centers.  dss  is  the  average  intrapciuster 
variance,  defined  as 


A:  =  1,2,3. 


(led) 


It  is  used,  in  general,  to  take  into  account  the  spread  of  the  clusters.  For  the  two- 
dimensional  data  sets  considered  here,  m  =  2  and  6*  =  1  for  fc  =  1,2,3.  Finally,  in 
each  data  set,  the  cluster  centers  are  chosen  to  be  on  a  circle  centered  at  (15,15)  with 
radius  r  and  120°  apart.  Hence,  in  all  the  data  sets,  d  =  y/Zr  and  is  proportional  to  r. 

The  AIC’s  computed  under  the  assumption  of  a  generalized  Gaussian  mixture  model 
with  <*o  =  3.0, 2.0, 1.0, 0.5,  respectively,  are  computed  for  each  data  set  in  Fig.’s  7.3.4-7.3.7, 
with  Kmax  =  5.  The  results  are  shown  in  Tables  7.3.3-7.3.G,  respectively.  Examining  the 
resulting  AIC’s,  we  can  see  that  the  AIC  with  different  assumptions  on  a,  denoted  by 
different  ao’s,  correctly  identifies  the  number  of  clusters  for  most  of  the  data  sets  in  Fig.’s 
7.3.4-7.3.7  with  cluster  distances  of  r  =  5.0, 2.5, 2.0;  it  breaks  down  only  for  data  sets  with 
cluster  distance  of  r  =  1.5,  when  the  clusters  are  so  close  that  they  are  hard  to  distinguish 
visually.  Interestingly,  however,  in  the  case  of  r  =  1.5,  when  the  AIC  provides  erroneous 
decisions,  it  generally  suggests  that  there  is  only  one  cluster.  An  exception  is  the  case 
illustrated  in  Table  7.3.6  where  the  data  is  generated  from  a  generalized  Gaussian  mixture 
with  a  =  0.5,  as  in  Fig.  7.3.7.  In  this  case,  the  data  clusters  contain  many  “outliers”; 
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that  is,  data  points  that  belong  to  a  cluster  but  are  relatively  far  from  the  cluster  center. 
The  AIC  computed  based  on  assumptions  of  oto  =  3.0, 2.0,  and  sometimes  1.0,  makes 
erroneous  decisions  even  for  the  data  set  with  r  >  2.0  where  the  clusters  are  visually  quite 
distinguishable.  On  the  other  hand,  the  AIC’s  computed  under  the  assumption  of  ao  =  0.5 
still  provide  correct  decisions,  even  for  a  =  0.5  and  r  =  1.5.  In  the  latter  case,  the  clusters 
are  still  visually  distinguishable,  since  a  =  0.5  represents  a  pdf  very  peaked  around  the 
mean  and  with  heavy  tails. 

From  these  results,  we  can  observe  that  when  the  data  clusters  are  from  the  generalized 
Gaussian  class,  and  they  do  not  contain  many  outliers  (e.g.,  a  >  2),  the  AIC  not  only  makes 
correct  decisions  when  the  clusters  are  far  apart,  it  remains  effective  when  the  clusters  are 
relatively  close,  and  only  breaks  down  when  the  clusters  tend  to  merge  as  one,  in  which 
case  it  reasonably  indicates  that  there  is  one  cluster.  When  the  data  does  contain  many 
outliers  (e.g.,  a  <  1),  the  AIC  computed  based  on  a  heavy-tailed  generalized  Gaussian 
component  pdf  assumptions  (e.g.,  ao  =  0.5)  still  provides  correct  results.  This  brings  up 
the  next  subject  of  discussion. 

C.)  Relative  Robustness : 

In  this  part,  we  first  study  the  robustness  of  the  AIC  under  different  generalized 
Gaussian  modeling  assumptions;  namely,  the  performance  of  the  AIC  when  the  assumed 
value,  ao,  of  the  characteristic  exponent  differs  from  the  true  value,  a.  Some  indication 
of  this  behavior  can  already  be  seen  from  Tables  7. 3. 3-7. 3. 6.  For  example,  taking  the 
Gaussian  assumption  as  an  illustration,  we  can  observe  that  this  choice  (i.e.,  ao  =  2)  is 
quite  robust  as  long  as  there  are  not  too  many  outliers  in  the  data  clusters,  i.e.,  provided 
a  >  1.  It  breaks  down  quickly  when  there  are  too  many  outliers  in  the  data  clusters,  i.e., 
a  <  1.  Similar  observations  can  be  made  for  the  case  of  ao  =  3  and  ao  =  1. 

An  assumption  of  ao  <  1,  on  the  other  hand,  is  quite  robust.  For  example,  the 
assumption  of  ao  =  0.5  provides  very  robust  performance  and  correct  identification  for  all 
the  data  sets  with  r  =  5.0, 2.5, 2.0.  Furthermore,  with  r  =  1.5  it  either  provides  correct 
identification  or  suggests  that  there  is  only  one  cluster,  which  is  quite  reasonable. 

To  compare  the  relative  robustness  of  the  Gaussian  assumption  with  that  of  a  heavy¬ 
tailed  pdf  (e.g.,  ao  <  1),  it  is  also  instructive  to  look  at  the  way  variances  of  the  mixture 
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component  pdf’s  are  estimated  under  different  assumptions  on  ao.  Notice  that,  from 
expression  (8b),  when  there  are  many  outliers,  the  estimate  of  the  variance  under  the 
Gaussian  assumption  will  be  degraded  due  to  the  influence  of  the  outliers  as  a  result  of 
the  squaring  operation  on  the  data  samples.  On  the  other  hand,  from  (9),  when  we  adopt 
an  assumed  generalized  Gaussian  model  with  ao  <  1,  the  effect  of  the  outliers  in  the 
estimate  of  variance  would  be  less  compared  to  the  Gaussian  case  (ao  =  2).  In  Appendix 
B,  we  show  that  the  log  maximum-likelihood  functional  in  the  AIC  of  (3)  for  a  given  data 
set  depends  only  on  the  variances  of  the  mixture  components,  and  the  number  of  data 
points  assigned  to  each  of  the  classes  during  the  ML  estimation  process.  In  particular, 
from  expression  (B8)  of  Appendix  B,  large  intra-cluster  variances  tend  to  reduce  the  log 
likelihood,  or  increase  the  AIC  for  a  given  K,  the  number  of  classes.  Therefore,  when 
there  are  a  considerable  number  of  outliers  in  the  true  data  clusters,  they  will  affect  the 
estimate  of  the  intrarcluster  variances  and  hence  the  computed  AIC  and  the  corresponding 
estimate  of  the  number  of  classes.  Furthermore,  the  outliers  tend  to  affect  AIC(K)  more 
for  smaller  K' s.  Consider  the  case,  for  example,  where  there  are  actually  two  clusters  in 
the  data,  relatively  far  apart  but  each  containing  a  considerable  number  of  outliers.  When 
we  compute  AIC(K)  for  K  =  2  by  the  approximate  ML  algorithm  described  previously, 
the  if -means  algorithm  is  used  first  to  separate  the  data  into  two  clusters.  Suppose  that 
the  separation  is  reasonably  close  to  what  the  two  clusters  should  be.  Then,  in  each  of  the 
clusters,  the  outliers,  being  far  from  the  cluster  center,  would  tend  to  increase  the  estimate 
of  the  variance  of  the  clusters,  hence  increasing  the  AIC.  On  the  other  hand,  when  we 
compute  AIC(K)  for  K  >  2,  the  outliers  in  question  will  be  closer  to  some  cluster  centers 
than  in  the  case  of  K  =  2,  since  the  new  clusters  are  now  formed  from  the  original  two 
data  clusters  in  the  case  of  K  =  2.  As  an  extreme  case,  when  K  is  equal  to  the  number  of 
data  points,  all  the  data  points  will  be  cluster  centers  and  the  intra-cluster  variance  will  be 
zero.  To  summarize,  if  we  compute  the  AIC  under  the  Gaussian  assumption,  the  outliers 
will  tend  to  affect  the  efficacy  of  the  AIC,  and  this  affect  is  more  severe  when  the  actual 
number  of  data  clusters  is  small.  For  other  generalized  Gaussian  assumptions,  the  outliers 
also  affect  the  computed  AIC’s,  and  the  effects  on  assumptions  for  ao  >  1  is  greater  than 
when  ao  <  1.  Next,  in  order  to  provide  more  insight  into  the  robustness  problem,  we  look 
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at  it  in  the  light  of  the  theory  of  robust  statistics. 

Up  to  now,  we  have  used  the  word  robust  in  a  loose  sense.  In  particular,  we  have 
said  that  a  specific  modeling  assumption,  for  example,  a  generalized  Gaussian  mixture 
assumption  (e.g.,  ao  =  0.5),  is  robust  if  the  AIC  computed  based  on  it  performs  well  when 
the  actual  data  is  from  a  different  distribution;  for  example,  a  Gaussian  mixture.  How¬ 
ever,  the  word  "robustness”  has  a  more  strict  meaning  in  the  theory  of  robust  statistics 

[38] , [39].  More  specifically,  an  estimate  of  a  parameter,  such  as  the  mean  or  variance,  com¬ 
puted  under  a  given  modeling  assumption,  is  called  robust  if  it  is  relatively  insensitive  to 
small  deviations  of  the  actual  data  distribution  from  the  assumed  model.  In  our  previous 
examples  using  the  synthetic  data  in  Fig.’s  7. 3.4-7. 3. 7,  the  AIC’s  computed  under  the  gen¬ 
eralized  Gaussian  modeling  assumption  of  oco  =  0.5  are  quite  insensitive  to  the  actual  data 
distribution,  for  example,  for  a  =  3,2,1.  Here,  the  deviation  of  the  model  is  characterized 
by  deviations  in  a,  the  characteristic  exponent  of  the  distribution.  However,  in  the  theory 
of  robust  statistics,  more  general  deviations  from  the  assumed  model  are  considered.  For 
example,  let  the  assumed  model  for  the  observed  random  variable,  X,  be  a  pdf  denoted 
by  /0(x).  A  small  deviation  from  the  assumed  model  can  be  described  as  the  mixture  pdf 

f{x)  =  (1  -  e)/o(x)  +  C0(*),  (17) 

where  g(x)  could  be  any  valid  pdf,  while  e  is  a  small  positive  number  that  describes  the 
fraction  of  gross  error  in  observed  data. 

The  theory  of  robust  statistics  concerns  finding  optimal  robust  estimators.  Two  im¬ 
portant  criteria  for  optimality  are  the  minimax  principle  and  the  gross-error-sensitivity 
criterion.  The  minimax  principle  provides  the  most  efficient  estimate  of  the  given  pa¬ 
rameter  for  the  worst  deviation  from  the  assumed  model  [38].  The  gross-error-sensitivity 
criterion  provides  the  most  efficient  estimate  under  a  given  upper-bound  of  gross-error- 
sensitivity.  It  has  been  shown  that  these  two  approaches  provide  the  same  results  for 
a  number  of  important  problems,  such  as  the  robust  estimation  of  mean  and  variance 

[39] .  In  this  work,  we  use  some  of  the  methods  of  robust  statistics,  in  particular,  methods 
for  robust  variance  estimation,  to  obtain  more  insight  into  the  problem  of  robustness  in 
computing  the  AIC  under  different  modeling  assumptions.  More  specifically,  we  seek  al- 
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ternative  methods  for  computing  the  AIC  such  that  the  efficacy  will  be  preserved  when  the 
data  classes  contains  outliers  and  the  actual  data  distribution  deviates  from  the  modeling 
assumptions.  Comprehensive  treatment  of  the  theory  of  robust  statistics  and  its  various 
applications  can  be  found  in  [38]-[40]. 

In  this  work,  we  take  the  following  approach  for  robust  estimation  of  the  AIC.  We 
assume  the  data  pdf  deviates  slightly  from  a  Gaussian  mixture  due  to  outliers  in  each  of 
the  actual  data  clusters.  More  specifically,  we  assume  that  the  pdf  of  each  of  the  mixture 
components  deviates  from  an  assumed  nominal  Gaussian  pdf.  This  can  also  be  described 
by  expression  (17),  where  /<>(•)  is  the  Gaussian  mixture  component  pdf  and  g(-)  could 
be  any  valid  pdf.  Under  the  Gaussian  assumption,  as  pointed  out  previously,  the  log 
likelihood  of  the  AIC{K ),  for  a  given  K,  depends  only  on  the  intra-  cluster  variances  and 
the  number  of  data  points  in  each  of  the  K  clusters  resulting  from  the  K- -means  clustering 
procedure.  Hence,  we  should  be  able  to  achieve  robust  estimates  for  the  AIC’s  through 
using  robust  estimates  of  the  intrarduster  variances.  In  this  approach,  when  the  AIC  is 
computed,  we  still  use  the  expression  based  on  a  Gaussian  assumption  but  we  replace  the 
variance  estimates  of  (8b)  by  robust  estimates.  Since  we  assume  the  components  of  the 
data  vectors  are  independent,  we  can  look  at  the  robust  variance  estimation  for  individual 
components  of  the  observed  random  data  vectors  of  a  given  class. 

The  approach  we  take  is  based  on  the  so-called  influence  functions  as  treated  in  [39]. 
Consider  the  problem  of  estimating  a  parameter  (e.g.,  mean  or  variance)  of  an  underlying 
distribution  from  a  number  of  i.i.d  observations;  for  example,  the  ML  estimation  problem 
in  (8)-(9)  of  Section  7.3.4.  An  influence  function  is  defined  for  a  random  observation,  a 
parameter,  and  a  pdf.  The  pdf  is  usually  the  ideal,  or  nominal,  model  for  the  random 
variable;  for  example,  /o(-)  in  (17)  from  which  the  real  data  pdf  deviates.  The  influence 
function  describes  the  asymptotic  effect  of  a  contaminated  observation,  or  outlier,  on  the 
estimate  of  the  parameter  (for  example,  the  variance)  for  an  assumed  ideal  pdf  model.  More 
specifically,  let  the  contaminated  observation  be  denoted  by  x,  the  ideal  pdf  be  denoted 
by  /o  and  the  estimate  of  the  statistic  from  a  finite  number  of  samples  by  Tn,  where  n  is 
the  number  of  samples.  Furthermore,  assume  the  estimates  under  consideration  are  Fisher 
consistent  under  the  ideal  model,  i.e.,  when  fo  is  acting,  limn_0o  Tn  =  T  in  probability, 
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where  T  =  T(fo)  can  be  considered  a  functional  of  the  underlying  distribution.  Usually, 
this  functional  can  be  expressed  as  an  expectation  (or  a  function  of  it)  with  respect  to  this 
underlying  distribution.  For  example,  in  the  case  of  the  ML  estimate  of  variance  in  (8b), 
we  have 


r  =  r(Pli,)  =  J*“  SriHvWv 

=  £p”,{yJ}.  (is) 

or,  for  simplicity,  T  =  a .  Then,  in  general,  the  influence  function  can  be  defined  in 
terms  of  x,  T,  and  f0  as  [39] 

INF(x,T,M  =  J^[r((l  -  t)fo  +  «,))]  (19) 

with  T  considered  an  appropriately  defined  functional  of  the  indicated  pdf,  while  8X  is  the 
delta-function  pdf  with  the  unit  probability  mass  concentrated  at  x.  In  addition,  T  must 
also  be  assumed  Fisher  consistent  under  the  mixture  pdf  (1  -  t)f0  +  t6x  for  0  <  t  <  1. 

The  influence  function  provides  insights  into  the  relative  robustness  of  an  estimate  of 
a  parameter.  For  example,  the  influence  function  of  the  ML  estimate  of  variance  of  (8b) 
under  a  zero-mean  unit-variance  Gaussian  assumption  for  /o  can  be  shown  to  be  [39] 

INF(x,T,fo)  =  xa  -  1.  (20) 

From  (20),  we  can  see  that  when  a  contaminated  observation  is  far  from  the  mean  (which  is 
zero  here),  it  has  a  large  effect  on  the  resulting  estimate  of  variance,  T,  due  to  the  squaring 
operation  in  the  ML  estimate  of  variance.  This  agrees  with  the  intuitive  observation  made 
in  previous  sections. 

The  component  variance  is,  in  general,  a  measure  of  the  intra-cluster  spread,  or  dis¬ 
persion.  Often  it  is  possible  to  develop  alternative  measures  of  intra-cluster  spread  which 
are  less  sensitive  to  outliers  and  can  be  utilized  in  place  of  the  ML  estimate  of  variance  in 
computing  the  AIC.  For  example,  suppose  we  measure  the  intra-cluster  spread  in  terms  of 
the  mean  absolute  deviation 
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and  then  square  this  to  obtain  the  statistic  .  In  general,  this  is  a  biased  estimate  of 
the  true  component  variance,  aKk  .  Nevertheless,  it  is  often  the  case  that  asymptotically 
this  statistic  converges  to  a  quantity  proportional  to  the  true  variance  with  the  constant  of 
proportionality  independent  of  the  data.  This  is  the  case,  for  example,  with  the  generalized 
Gaussian  component  distribution.  It  follows  then  from  (B8)  that  asymptotically  the  log- 
likelihood  functional  computed  using  the  statistic  given  by  (21)  is  equivalent  to  that 
using  the  ML  estimate  of  variance.  This  follows  by  arguments  used  in  Appendix  B  in 
demonstrating  the  equivalence  of  (Bll)  and  (B8)  by  eliminating  data  independent  terms. 
This  then  provides  the  rationale  for  using  alternative  statistics,  such  as  given  by  (21),  in 
place  of  ML  estimates.  Furthermore,  this  statistic  has  an  influence  function 

=  (22) 

as  is  shown  in  Appendix  C.  Here,  T  =  where  the  subscript  indicates  the 

expectation  is  performed  with  respect  to  the  pdf  fo-  Compared  to  the  ML  estimate, 
the  effect  of  the  outlier  is  smaller.  Indeed,  it  has  been  shown  [38]  that  (21)  is  a  more 
robust  estimate  of  variance  when  the  real  pdf  deviates  from  the  ideal  Gaussian  assumption 
although  it  is  less  efficient  under  the  ideal  Gaussian  assumption. 

Now,  this  statistic  can  be  extended  to  a  simple  heuristic  method  for  estimation  of  the 
intra-cluster  spread  to  reduce  the  effects  of  the  outliers.  This  estimator  is  given  by 


*W*  - 

ak,h  — 


(23) 


where  0  <  0  <  1,  and  the  subscript  h  indicates  "heuristic”.  The  corresponding  influence 
function  is 


INFh(z,T,M  =  2A(0)(|*|'  -  A(0)), 


(24a) 


where  A(f3)  is  a  constant  for  specified  /?,  given  by 


m = iswmi0'}-  (24b) 

Here,  T  =  A(fi)2.  In  the  next  part,  we  will  show  that  this  heuristic  scheme  provides  some 
interesting  results  for  identifying  the  number  of  classes  in  images.  Finally,  the  influence 
function  for  the  optimal  robust  estimator  (in  the  sense  of  both  minimax  and  gross-error 
criterion)  in  this  case  is 


WF(.,r,/,)=i,-l;  x3  -  1  <  6; 

=  b]  x2-l>b,  (25) 

where  T  —  for  6  =  +oo.  This  results  in  an  estimate  identical  to  the  ML  estimate 
except  that  the  observations  will  be  “clipped”  in  such  a  way  that  in  (8b),  if  the  square 
of  the  observed  value  minus  the  mean  exceeds  6,  the  squared  value  will  be  replaced  by  b. 
.  In  this  scheme,  b  is  a  constant  to  be  determined  from  the  available  information  about  the 
deviation  of  the  model;  for  example,  the  e  in  expression  (17). 

In  Fig.  7.3.8,  we  have  illustrated  several  of  the  influence  functions  discussed  above. 
As  a  comparison,  we  have  also  shown  the  influence  function  for  the  ML  estimate  computed 
under  a  generalized  Gaussian  assumption,  i.e.,  expression  (9)  with  assumed  characteristic 
exponent  ao  =  0.5, 2.0,  when  the  data  distribution  is  described  by  (17)  as  a  deviation  from 
the  Gaussian  assumption.  This  influence  function  can  be  shown,  following  the  approach 
in  Appendix  C,  to  be  given  by 


INF[x,T,  fo)  =  C(a0)INFk(x,T,f0),  (26a) 

where  INFh(x,  T,  /o)  is  the  influence  function  for  the  heuristic  estimator  in  expression  (24) 
with  0  replaced  by  oq,  while  C(a o)  is  a  constant  for  a  specified  ao,  given  by 


C(a o)  = 


T(3/a0)- 

r(l/ao). 


(26b) 
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Here,  T  =  (f?/0{|Y|j^  °})2/ao  and  oto  is  the  assumed  exponent  for  the  generalized  Gaus¬ 
sian  assumption.  For  example,  with  ao  =  1  we  find  C(oq)  =  2.0  indicating  that  the 
influence  function  for  the  corresponding  ML  estimator  increases  twice  as  fast  as  that  for 
the  statistic  of  (21),  i.e.,  the  special  case  of  the  heuristic  estimator  with  /?  =  1.0. 

It  can  be  seen  that,  besides  the  optimal  robust  estimator,  the  heuristic  estimator  is 
likely  to  be  able  to  reduce  the  effects  of  outliers,  while  all  the  estimators  are  better  than 
the  ML  under  the  Gaussian  assumption  as  far  as  combating  outliers  are  concerned^  (their 
influence  functions  increase  slower  than  that  of  the  ML  estimate). 

Finally,  we  want  to  make  two  remarks.  First  of  all,  the  estimators  considered  here  are 
not  all  robust  in  the  strict  sense  of  robust  statistics,  except  the  optimal  robust  estimator, 
in  that  their  influence  functions  are  not  bounded.  However,  the  heuristic  estimator  will 
greatly  reduce  the  effect  of  the  outliers  since  its  influence  function  increases  much  slower 
as  the  magnitude  of  the  observation  increases.  Hence,  this  scheme  may  still  be  effective 
in  practice.  Secondly,  the  heuristic  estimator  may  perform  better  than  the  generalized 
Gaussian  model-based  ML  estimator  when  ao  =  /?,  since  its  influence  function  increases 
slower  than  that  of  the  generalized  Gaussian.  For  example,  we  have  already  seen  that 
when  ao  =  /?  =  1.0,  C(ao),  the  ratio  of  the  two  influence  functions  for  the  generalized 
Gaussian  ML  estimator  of  (9)  and  the  heuristic  estimator  of  (23),  is  2.0;  for  ao  =  /?  —  0.5, 
this  ratio  increases  to  10.1.  This  means  the  influence  function  for  the  heuristic  estimator 
increases  much  slower  in  z  than  that  of  the  ML  estimator  under  the  generalized  Gaussian 
assumption  and  that  this  improvement  is  greater  for  smaller  ao  =  /?.  This  behavior  can 
also  be  seen  from  the  results  in  the  next  part. 

D.)  Application  to  Synthetic  Images'. 

As  described  in  Section  7.3.3,  the  AIC  can  be  applied  to  determine  the  number  of 
stochastic  classes  in  an  image.  This  can  be  achieved  by  finding  the  number  of  clusters 
in  the  sample  model  parameter  vectors  estimated  from  a  sliding  estimation  window  on 

^It  should  be  noted  that  the  curves  in  Fig.  7.3.8  for  the  Gaussian  ML  estimate  and  the 
generalized  Gaussian  ML  estimate  for  ao  =  0.5  eventually  cross  indicating  better  outlier 
rejection  properties  under  this  ao  assumption  for  large  deviations. 
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different  spatial  positions  of  the  image.  When  these  vectors  are  obtained,  the  procedure 
of  identifying  the  number  of  classes  should  be  no  different  from  that  for  the  synthetic 
random  data  described  previously.  In  addition,  the  AlC-based  model-fitting  approach 
does  not  put  any  restriction  on  the  model  for  the  image  classes.  On  the  other  hand,  the 
parameter  vectors  obtained  from  the  sliding  estimation  window  might  not  be  a  Gaussian 
mixture  or  even  a  generalized  Gaussian  mixture.  If  the  image  classes  can  be  modeled  by 
the  independent  Gaussian  assumption,  the  estimated  mean  and  the  estimated  variance  will 
be  approximately  Gaussian.  However,  when  the  image  classes  are  modeled  as  AR  or  MRF 
models,  the  estimated  model  parameters  need  not  be  Gaussian  mixtures.  In  this  case,  the 
robust  estimation  procedures,  described  above,  may  be  useful. 

In  Fig.  7.3.9,  we  have  shown  two  two-class  texture  images  of  size  256  x  256  pixels. 
In  both  cases,  two  textures  are  combined  according  to  the  region  map  shown  in  Fig.  7.3.9 
a.  In  the  first  image  (Fig.  7.3.9  b),  the  textures  are  realizations  of  the  FQPF  AR  random 
field  characterized  by  parameter  vector  =  (a*,6fc,Cfc,oWfc),  A:  =  1,2,  as  described  pre¬ 
viously.  In  this  work,  we  have  chosen  ai  =  (—0.3, 0.7, 0.5, 9.5), a2  =  (-0.3,0.5,0.7,14.1). 
A  common  mean  of  100  is  added  to  both  classes.  In  the  second  image  (Fig.  7.3.9  c), 
the  textures  are  realizations  of  the  BIA  MRF’s,  each  characterized  by  a  parameter  vector 
a*  =  (flfc«6fc),fc  =  1,2.  In  this  work,  we  have  chosen  at  =  (—2.0, 1.0), a2  =  (2.0,— 1.0). 

For  each  image,  the  sample  model  parameter  vectors  are  estimated  from  a  16  x  16 
pixel  sliding  estimation  window  with  horizontal  and  vertical  displacement,  M2,jV2,  also 
both  16  pixels;  hence  there  are  256  sample  moael  parameter  vectors  for  each  image.  The 
sample  model  parameters  for  the  MRF  texture  image,  being  two-dimensional,  are  plotted 
in  Fig.  7.3.10  b.  Since  the  sample  model  parameter  vectors  for  the  AR  textures  are  four¬ 
dimensional,  they  cannot  be  plotted  on  a  plane;  only  the  the  second  and  third  components 
of  these  vectors  are  plotted  in  Fig.  7.3.10  a.  We  can  see  from  Fig.  7.3.10,  the  sample 
parameter  vectors  for  the  MRF  texture  image  do  not  show  a  clear  clustering  tendency;  the 
second  and  third  components  of  the  sample  model  vectors  for  the  AR  textures,  however, 
do  show  a  tendency  of  two  clusters.  In  Tables  7.3.7  and  7.3.8,  we  show  the  computed 
AIC  for  the  sample  model  parameter  vectors  for  both  images  under  the  assumptions  of  a 
generalized  Gaussian  mixture  with  ckq  =  3.0, 2.0, 1.0, 0.5.  The  performance,  as  indicated  in 
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Table  7.3.8  a,  for  the  MRF  textures  under  all  assumptions  are  relatively  poor,  as  expected. 
The  performance  indicated  in  Table  7.3.7  a  for  the  second  and  third  component  of  the  AR 
textures  for  all  the  assumptions  of  ato  is  encouraging;  the  AIC  makes  correct  decisions.  On 
the  other  hand,  the  AIC  computed  for  all  four  components  of  the  model  vectors  of  the  AR 
texture  image,  as  indicated  in  Table  7.3.7  b,  is  not  so  encouraging.  K  —  2  provides  only 
a  local  minimum  of  the  AIC’s  computed.  This  may  due  to  the  fact  that  the  geometrical 
distribution  of  the  model  vectors  with  four  components  in  the  parameter  vector  space  are 
more  complex  than  that  for  the  case  of  two  components  and  cause  more  data  points  to  be 
outliers.  We  want  to  point  out,  however,  the  relatively  poor  performance  such  as  for  the 
MRF  does  not  undermine  the  efficacy  of  the  AIC  but,  rather,  it  points  out  the  need  for 
more  careful  procedures  for  obtaining  the  model  parameter  vectors.  More  specifically,  the 
model  vectors  obtained  from  sliding  estimation  windows  that  contains  contaminated  data 
from  two  texture  classes  should  be  detected  and  rejected  such  that  the  remaining  set  of 
sample  model  parameter  vectors  do  show  a  reasonable  cluster  tendency.  In  addition,  the 
success  of  the  AIC  on  the  second  and  third  component  of  the  AR  model  parameter  vectors 
justifies  the  efficacy  of  the  AIC  in  that  when  the  data  does  show  a  clustering  tendency,  the 
AIC  tends  to  make  correct  decisions. 

An  alternative  for  improving  the  estimates  of  the  sample  model  parameter  vectors 
is  to  use  the  robust  estimator  in  (23)  or  the  optimal  robust  variance  estimator  of  (25). 
The  computed  AIC  using  the  heuristic  estimator  with  /?  =  0.5  and  the  optimal  robust 
estimator  with  oo  =  2  for  both  the  AR  model  parameter  vectors  (in  four  dimensions)  and 
the  MRF  model  parameter  vectors  are  shown  in  Tables  7.3.7  c  and  7.3.8  b,  respectively. 
For  the  optimal  robust  estimator,  the  clipper  threshold,  b,  is  chosen  heuristically  to  be  two 
or  three  times  the  variance  estimated  under  the  Gaussian  assumption  with  two  clusters 
since,  in  this  case,  very  little  is  known  a  priori  about  the  model  deviation  (e.g.,  e).  The 
results  are  also  shown  in  Table’s  7.3.7  c  and  7.3.8  b.  In  this  case,  correct  decisions  are 
made.  However,  we  see  that  for  the  optimal  robust  estimator  to  be  more  practical,  some 
scheme  is  needed  to  determine  the  “clipper  threshold” ,  b. 

E.)  Application  to  Real  Image  Data : 

In  this  experiment,  we  attempt  to  apply  the  method  described  in  the  previous  section 
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to  estimate  the  number  of  statistical  image  classes  in  real  images.  In  particular,  the  images 
are  digitized  aerial  photographs  of  size  256  x  256.  For  simplicity,  we  consider  the  Gaussian 
model  of  (13)  for  the  image  classes  [2].  That  is,  each  image  class  is  modeled  by  an  i.i.d. 
Gaussian  random  field.  Then,  each  image  class  can  be  characterized  in  terms  of  a  model 
parameter  vector  consisting  of  only  two  components;  the  mean  and  variance. 

In  Fig.’s  7.3.11  a  and  7.3.12  a,  we  show  two  aerial  photographs.  The  first  contains  a 
building,  roads  and  vegetation  while  the  second  contains  an  oil  tank  complex  surrounded 
by  vegetation.  The  corresponding  computed  AIC’s  for  different  numbers  of  clusters  are 
shown  in  Table  7.3.9  with  Kmax  =  10.  The  sliding  estimation  window  is  of  size  16  x  16 
pixels  and  the  vertical  and  horizontal  displacements,  M2  and  N 2  in  Fig.  7.3.2,  are  also 
each  16  1  xels.  The  results  suggest  that  in  the  first  image  there  are  four  classes  while  for 
the  second  image  five  classes  best  fits  the  data.  The  images  are  segmented  using  a  ML 
technique  [2]  with  the  corresponding  model  vectors  estimated  according  to  that  suggested 
by  the  AIC  criterion  and  are  shown  in  Fig.’s  7.3.11  and  7.3.12,  along  with  the  original 
images.  In  these  segmentations  different  tonal  areas  are  well  separated.  For  comparison 
purpose  we  have  also  shown  the  results  of  the  segmentation  using  from  two  up  to  six  classes. 
It  can  be  seen  from  the  results  for  both  images  that,  when  the  assumed  number  of  classes 
is  smaller  than  that  determined  by  the  AIC,  a  number  of  significant  regions  of  reasonably 
large  size  are  missing  from  the  segmentation.  On  the  other  hand,  when  the  number  of 
classes  is  larger  than  that  suggested  by  the  AIC,  no  significant  change  in  segmentation 
will  result  from  the  increase  of  the  number  of  classes  except  the  appearance  of  some  noisy 
regions  with  small  size.  This  suggests  that  the  AIC  model-fitting  approach  is  a  reasonable 
objective  approach  for  practical  applications  such  as  image  segmentation. 

7.3.7  Summary! 

In  this  paper  we  described  a  model-fitting  approach  for  determining  the  number  of 
clusters  in  observed  random  data  and  its  applications  to  stochastic  model-  based  image 
segmentation.  The  problem,  also  known  as  cluster  validation,  is  solved  by  finding  a  best¬ 
fitting  mixture  distribution  model  to  the  observed  data.  The  goodness  of  fit  is  determined 
by  the  AIC  criterion.  An  approximate  ML  parameter  estimation  scheme  using  clustering 
is  proposed  to  compute  the  AIC.  Experimental  results  are  also  described  to  demonstrate 
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the  efficacy,  relative  robustness  and  practical  applicability  of  this  method. 

In  the  experiments  involving  synthetic  data,  the  AIC  correctly  determines  the  number 
of  clusters  in  the  mixture  data,  continues  to  correctly  identify  the  number  of  data  clusters 
as  the  distance  between  the  clusters  decreases  and  will,  in  general,  suggest  a  single  cluster 
when  the  distance  is  so  small  that  the  data  clusters  appear  visually  to  be  merged.  We 
have  investigated  the  robustness  of  the  AIC  under  different  modeling  assumptions  using 
the  generalized  Gaussian  mixture  model.  In  particular,  we  have  used  the  generalized 
Gaussian  mixture  both  in  providing  synthetic  data  for  the  experiments  and  as  modeling 
assumptions  in  computing  the  AIC’s.  The  Gaussian  mixture  assumption  is  quite  robust 
when  the  data  clusters  do  not  contain  many  outliers.  A  generalized  Gaussian  assumption 
with  ao  <  1  provides  very  robust  performance  even  when  the  data  clusters  do  contain  many 
outliers.  We  have  also  considered  the  robustness  problem  in  the  lit  of  the  theory  of  robust 
statistics  when  the  actual  data  is  not  a  generalized  Gaussian  mixture,  but  some  deviation 
from  a  nominal  Gaussian  mixture.  This  leads  to  a  heuristic  variance  estimator  and  an 
optimal  robust  variance  estimator  for  computing  the  AIC  when  the  data  pdf  deviates  from 
the  nominal  Gaussian  mixture  assumption.  This  approach  makes  the  performance  of  the 
AIC  more  robust  against  outliers  in  the  data  clusters. 

In  the  application  to  image  data  or  image  segmentation,  we  have  used  the  AIC  to 
identify  the  number  of  image  classes  where  they  are  modeled  as  the  simple  independent 
Gaussian,  or  more  complex  AR  and  MRF  models.  The  AIC  computed  using  more  ro¬ 
bust  variance  estimators,  such  as  the  optimal  robust  variance  estimator  and  the  heuristic 
variance  estimator  correctly  identifies  the  number  of  classes  in  the  synthetic  AR  or  MRF 
mixture  images.  For  real  aerial  photo  images,  where  the  true  number  of  classes  is  unknown, 
the  AIC  computed  based  on  a  Gaussian  mixture  assumption  provides  identification  results 
that  agree  well  with  subjective  assessments. 

This  work  also  brings  up  several  interesting  issues  for  further  investigation.  For  exam¬ 
ple,  it  would  be  of  interest  to  use  the  EM  algorithm  as  the  estimation  method  for  computing 
the  AIC  and  compare  the  results  with  those  described  in  this  paper.  Another  interesting 
issue  is  to  further  understand  the  theoret  ai  aspects  of  the  minimum  AIC  principle;  for 
example,  to  characterize  the  performance  of  the  AIC  under  different  data  distributions  in 
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terms  of  probability  of  correct  decision.  Finally,  additional  study  is  required  on  the  appli¬ 
cation  of  the  AIC  to  image  segmentation.  More  specifically,  estimation  methods  should  be 
developed  to  reject  sample  model  parameter  vectors  from  sliding  estimation  windows  that 
contain  a  significant  amount  of  data  from  more  than  one  image  class;  or  as  an  alternative, 
robust  estimation  techniques,  such  as  the  ones  described  in  this  paper,  need  to  be  fur¬ 
ther  tested  that  will  diminish  the  effects  of  such  ‘‘contaminated”  sample  model  parameter 
vectors. 
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Appendix  A 

Maximum-Likelihood  Estimation  for  Generalized  Gaussian  Models 
In  this  section,  we  will  derive  the  ML  estimate  of  the  variance  of  a  generalized  pdf 
model  from  a  set  of  observations.  For  simplicity,  we  consider  one-dimensional  observations 
and  assume  that  the  mean  of  the  pdf  is  zero.  Let  x  =  {x1,x2, . . .  x//}  (here  x  is  not  used 
for  images  as  in  Section  7.3.3)  be  a  set  of  i.i.d.  observations  of  a  random  variable  X  with 
a  generalized  Gaussian  pdf 


p(l)  =  2? 


(Al) 


where 


1 

V  =  ~ 
a 


(A2) 


T(3/o)1  1/2 

r(i/«)J  ’ 

and  a2  is  the  variance.  Assume  that  a  >  0  is  known;  we  would  like  to  estimate  a2  based  on 
the  observation  x.  From  equation  (A2),  this  is  equivalent  to  estimating  r).  An  ML  estimate 
of  77,  denoted  by  rjML,  is  one  that  maximizes  the  likelihood  functional  with  respect  to  rj 
for  given  x.  That  is 


f)ML  =  axg  maxpx(x|r/),  (A3) 

1 

where 

N 

P*(x|i7)  =  IIp(x<I^)>  (A4) 

»=i 

and  the  p(x«|i7)’s  are  as  in  (Al).  Maximizing  (A3)  is  equivalent  to  maximizing  the  logarithm 
of  (A4),  i.e., 


iog{px{x\r))}  =  log[p{xi  |»y)) 

»=i 

=  n  i^  +  )oa(2r  °/g)) 


(A5) 
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Appendix  B 

The  Expressions  for  the  A1C  Under  Different  Modeling  Assumptions 
In  this  section,  we  will  show  that,  under  the  Gaussian  or  the  generalized  Gaussian 
mixture  assumption  for  the  individual  components,  the  log  maximum  likelihood  in  the 
expression  for  the  AIC  will  only  depend  on  the  the  number  of  samples  in  each  cluster  and 
the  intra-cluster  variances  for  each  cluster. 

We  still  use  the  notation  established  in  Section  7.3.3.  Suppose  that  the  observed 
data  can  be  represented  by  N  i.i.d.  random  vectors,  Y  =  {yi,y2,  •  •  •  ,yw}.  In  [l],  we 
have  shown  for  the  approximate  ML  parameter  estimation  method  which  first  performs  a 
clustering,  the  likelihood  functional  is^ 

Pic 00  =  II  *kh  II  Pk(y *>)•  (Bl) 

fc=i  j-i 

Taking  the  logarithm, 


JC  /  Nk 

log{pK(Y)}  =  £  Nklog*k  +  Yllo3Pk{Yk,) 
fc=i  V 

Assume  the  y’s,  yeY,  are  m-dimensional  with  independent  components,  then 

log{pj c(Y)}  =  Yi  (  Nklo**k  +  £  lo9pk  '(Pfc?)  )  »  (B3) 

\  }=n=i  J 

where  p^(-)  is  the  *’th  component  pdf  of  the  fc’th  class  of  the  random  data.  It  is  easily 
seen  that  the  ML  estimates  (for  convenience,  we  suppress  the  subscript  ML  for  all  the  ML 
estimates)  of  the  mixture  weights,  irk,  ftfe  given  by 

*t  =  ^;*  =  1,2 . K.  (B4) 

Replacing  the  model  parameters  in  (B3)  by  their  ML  estimates  then  yields 

^For  notational  convenience,  we  have  suppressed  the  functional  dependence  upon  a  in 
writing  p*(Y)  for  pfc(Y|a)  as  in  (11). 
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log{p* r(Y)}  = 


(B5) 


where  p^(y^)  is  the  pdf  pj^(yj^)  with  all  parameters  replaced  by  their  corresponding  ML 
estimates.  Now,  we  need  to  find  the  second  term  in  the  brackets  under  different  modeling 
assumptions. 

1.)  The  Gaussian  Mixture: 

In  the  case  of  Gaussian  mixture  with  ML  estimates, 


Then 


(B6) 


J2  £ 

4=1 »=1 

Nk  TO 


(*) 


Nk  TO  r,  /.rn  -II 

~EE  + 


,(»')!  2 


,  TO  n k  to  /  (»)  _  A(*)\2 

=  -\mNtlo92*  -  K'Z'oiW  ~ E  E  (  ? 

»=1  j’=l  »=1  " 

=  -^mNktog2r  -  Nk^loga]^  - 


(B7) 


»=1 


If  we  substitute  (B7)  in  (B5)  and  drop  all  the  terms  that  do  not  depend  on  the  data  or  the 
order,  K,  of  the  mixture  model,  we  will  arrive  at  an  equivalent  log>likelihood  functional, 
or  sufficient  statistic,  given  by 


K  TO 

Lx(Y)  =  Y. Nk(lo,Nk  -  E ’)•  (B8) 

fc=l  »=1 

It  is  indeed  dependent  only  on  the  intra-cluster  variance  and  the  number  of  samples  in 
the  clusters.  This  quantity  can  then  be  used  in  place  of  the  log-likelihood  functional  in 
computing  AIC(K)  according  to  (3). 
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2.)  The  Generalized  Gaussian  Mixture'. 

The  generalized  Gaussian  distribution  has  a  similar  exponential  structure  as  the  Gaus¬ 
sian  distribution.  Hence,  the  derivation  here  is  similar  to  the  previous  part.  More  specifi¬ 
cally, 


5(»)ft/(<h  =  CID 

Pk  Wkj)  2T(l/a)  P 

Then,  similar  to  the  Gaussian  case,  we  have 


(»)\|a 


3=1  i=  1 


=  1 1  (MP  -  -  «  -  *P>r) 


(BIO) 


Substituting  this  in  (B5)  and  dropping  the  terms  that  do  not  depend  on  the  data,  we  have 


K  m  (  \ 

Ik(Y)  =  '£Nk  logNk  -  )  . 

k=l  »= 1  Klk  )  ■ 


(Bll) 


Notice  that  is  proportional  to  the  square  root  of  the  cluster  component  variance 

ajfK  Hence,  (Bll)  is  of  the  same  form  as  (B8)  when  we  again  eliminate  data  independent 


terms. 
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Appendix  C 

Examples  of  Influence  Functions 

In  this  section,  we  will  derive  the  influence  functions  for  the  estimators  outlined  in 
Section  7.3.4-C.  For  simplicity,  assume  that  the  observed  data  are  i.i.d.  random  variables, 
denoted  by  Yi,  Y2, . . . ,  Yn.  Assume  the  ideal  pdf,  /o,  of  the  random  variables  is  zero-mean, 
unit-variance  Gaussian.  Then  the  influence  function  of  the  asymptotic  limit  of  an  estimate, 
denoted  by  T,  is  defined  as  in  (19)  of  Section  7.3.4-C,  by 

INF(z,TJ0)  =  1[T((1  -  t)/o  +  «.)]  |_,  •  (Cl) 

Here,  x  is  a  contaminated  observation,  T  is  considered  as  a  functional  of  the  underlying 
data  distribution,  and  Sx  is  a  delta- function  pdf  at  x.  For  the  examples  we  have  considered 
throughout  this  paper,  Tn ,  the  estimate  of  a  statistic  based  on  n  samples  is  always  related 
to  an  averaging  operator  (e.g.,  (8),  (9),  (18),  (23));  hence,  T  is  always  related  to  an 
expectation  operator.  For  example,  for  the  heuristic  variance  estimator  of  (23)  in  Section 
7.3.4-C, 


*■  »=i 

where  0  <  0  <  1.  In  this  case,  Tn  =  0*,  hence, 


2 

1 


(C2) 


nn=Ej{ \y\’},  tea) 

where  /  is  any  valid  pdf  of  Y,  the  random  variable. 

To  find  the  influence  function  the  heuristic  variance  estimator,  we  can  take  the 
partial  derivative  of  T  in  (C3)  with  respect  to  t,  where  /  =  (1  —  t)fo  +  t6m.  Therefoie, 


|(r((i-i)/o  +  («,)] 

= JiMimH-fwijr} + 

=  2(£/{|K|«»(-£/0{|V|/’}  +  I*!'’)-  (C4) 
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Here,  we  have  used  the  following  property  of  the  expectation  operator 


=  (1  -  *)Eh{Y}  +  tE,,{Y },  (C5a) 

for  pdf’s  fi  and  /a,  and  0  <  t  <  1.  We  have  also  used  the  property  of  the  delta-function 
to  obtain 


Es.{Y)  =  i.  (C5b) 

Finally,  let  t  =  0  in  (C4),  the  influence  function  for  the  heuristic  variance  estimator  is 

!NFh(x,T,M  =  *(S/,{|r|#})(i*i#  -  (C6) 

which  is  the  same  as  (25)  of  Section  HI.  Other  influence  functions  discussed  in  this  paper 
can  be  found  using  the  same  procedure  as  presented  above. 
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Gaussian  mixture 


Gaussian  mixture 


K 

AIC  (K) 

1 

988 

2 

852 

3 

805 

4 

717  (min) 

5 

753 

6 

782 

7 

806 

8 

803 

c.)  Four-component 
Gaussian  mixture 


Table  7.3.2 

Computed  AIC's  for  the  Synthetic  Data  with  K  -8. 

max 
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Number  of 
Clusters 


Assumed  Exponential  Parameter 


Intercluster 

Distance 

K 

o* 

II 

C3 

o 

a0  =  2.0 

a<>  =  1.0 

(To  *  0.5 

■■ 

1 

1383 

1310 

1196 

1105 

**\*r 

1186 

1121 

1033 

972 

669  (min) 

544  (min) 

352  (min) 

205  (min) 

713 

582 

385 

239 

WM 

738 

602 

403 

258 

mm 

i 

820 

719 

563 

442 

2 

887 

773 

601 

472 

3 

661  (min) 

537  (min) 

346  (min) 

201  (min) 

4 

683 

558 

371 

232 

5 

750 

630 

447 

310 

1 

671 

560 

385 

246 

766 

649 

470 

331 

628  (min) 

505  (min) 

316  (min) 

170  (min) 

■■■ 

676 

556 

379 

249 

■ 

681 

562 

385 

252 

i 

508  (min) 

386  (min) 

196  (min) 

49  (min) 

2 

600 

480 

299 

162 

r  =  1.5 

3 

556 

432 

244 

96 

4 

598 

482 

315 

188 

5 

596 

478 

306 

175 

Table  7.3.3 


Computed  AIC's  for  the  Generalized  Gaussian  Mixture  Data  with  a  -  3.0. 


7.3.44 


Number  of 
Clusters 


Assumed  Exponential  Parameter 


Intercluster 

Distance 

K 

ffo  =  3.0 

o* 

ii 

l\3 

o 

a0  =  1.0 

cr0  =  0.5 

mm 

1 

1374 

1300 

1186 

1095 

iHEaf 

1168 

1095 

994 

925 

696  (min) 

542  (min) 

316  (min) 

150  (min) 

■wifi i 

■ 

739 

587 

366 

208 

HHI 

|Q 

776 

619 

400 

247 

■i 

i 

805 

703 

547 

426 

■Hit 

2 

856 

728 

556 

436 

3 

668  (min) 

519  (min) 

298  (min) 

134  (min) 

4 

702 

556 

343 

187 

IH 

5 

744 

599 

390 

239 

1 

656 

542 

369 

236 

2 

780 

650 

464 

327 

r  =  2.0 

3 

625  (min) 

484  (min) 

274  (min) 

117  (min) 

4 

681 

529 

333 

182 

5 

701 

563 

364 

215 

1 

494  (min) 

505  (min) 

177  (min) 

33  (min) 

2 

641 

505 

307 

158 

r  =  1.5 

3 

552 

411 

207 

52 

4 

590 

452 

262 

127 

5 

615 

478 

278 

126 

Table  7.3.4 

Computed  AIC's  for  the  Generalized  Gaussian  Mixture  with  a  ■  2.0 
(the  Gaussian  Mixture). 
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Number  of 
Clusters 


Assumed  Exponential  Parameter 


Intercluster 

Distance 

K 

ff0  *  3.0 

ffo  =  2.0 

S' 

II 

o 

in 

o 

ii 

1 

1376 

1356 

1196 

1113 

1184 

1092 

965 

880 

■bbts 

718  (min) 

494  (min) 

167  (min) 

-56  (min) 

725 

519 

216 

6 

IH 

776 

576 

293 

97 

■■ 

1 

899 

707 

559 

447 

■  | 

2 
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724 

542 

414 

3 

656  (min) 

450  (min) 

146  (min) 

-69  (min) 

4 

664 

473 

193 

-6 

5 

726 

531 

258 

72 

1 

662 

544 

379 

257 

1  1 

742 

604 

410 

271 

■ 

614  (min) 

413  (min) 

117  (min) 

-93  (min) 

635 

442 

164 

56 

wM 1 

692 

499 

235 

56 

1 

506  (min) 

366 

mam 

37 

2 

632 

475 

103 

r  *  1.5 

3 

561 

359  (min) 

-120  (min) 

4 

605 

403 

-56 

5 

652 

456 

I 

-2 

Table  7.3.5 

Computed  AIC's  for  the  Generalized  Gaussian  Mixture  with  a  ■  1.0 
(the  Laplacian  Mixture). 
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Number  of 
Clusters 


Assumed  Exponential  Parameter 


Intercluster 

Distance 

K 

ffo  *  3.0 

ao  =  2.0 

o* 

II 

*o 

CTo  =  0.5 

1 

1 
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1299 

1195 

1120 

i!-v 

■  a 
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914 
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-41  (min) 

694  (min) 
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419 
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mm 
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1  ■■  " 

2 
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329 

3 
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-432  (min) 

1 

4 

633  (min) 

355  (min) 

-66 

-364 

HU 

5 
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-26 

-293 

■ 

668 

533 

370 
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m  ' 

744 

572 

343 

183 

642 

334 

-120 

-433  (min) 

■fl 

556  (min) 

282  (min) 

-123  (min) 

-404 

5 

578 

311 

-72 

-303 

1 

531  (min) 

353 

150 

21 

2 

641 

432 

179 

20 

r  *  1.5 

3 

625 

309 

-126  (min) 

-412  (min) 

4 

619 

333 

-29 

-255 

5 

564 

289  (min) 

-77 

-312 

Table  7.3.6 

Computed  AIC's  for  the  Generalized  Gaussian  Mixture  with  a  -  0.5. 
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o 

CO 

II 

<3 

ffo  =  2.0 

a 

o 

II 

— * 

o 

a0  =  0.5 

-109 

-122  (min) 
-120 
-118 
-116 

-115 

-130  (min) 
-127 
-125 
-123 

-122 

-142  (min) 
-138 
-135 
-134 

-128 

-150  (min) 
-146 
-143 
-142 

a)  AlC’s  computed  for  the  second  and  third  components  of 
the  AR  model  vector. 


a0  =  3.0 

a0  =  2.0 

o 

T— 

II 

o 

ffo  =  0.5 

-155 

-190 

-189 

-191 

-192  (min) 

-166 

-207 

-206 

-208 

-209  (min) 

-181 

-230 

-229 

-231 

-234  (min) 

-191 

-248 

-245 

-247 

-250  (min) 

b)  AlC’s  computed  for  all  four  components  of  the  AR  model 
vectors. 


Heuristic  (/J  =  0.5) 

Robust  (ffo  =  2.0) 

-939 

-1144  (min) 
-1075 
-1057 
-1032 

-1978 

-2179  (min) 
-2149 
-2153 
-2136 

c)  AlC’s  computed  with  the  heuristic  variance  estimator 
with  P  =  0.5  and  the  robust  estimator  with  a0  -  2.0 

Table  7.3.7 

Computed  AIC's  for  the  Two-Class  AR  Texture  Image 


7.3.48 


K 

a0  -  3.0 

a0  =  2.0 

a 

o 

II 

o 

a0  =  0.5 

1 

196 

161 

114 

78 

MB 

-81 

-174 

-304 

-393 

-77 

-175 

310 

-404 

MM 

-151  (min) 

-240  (min) 

-367  (min) 

-457  (min) 

B 

-141 

-232 

-358 

-445 

a)  AlC's  computed  based  on  generalized  Gaussian 
assumptions. 


K 

Heuristic  {fi  =  0.5) 

Robust  (a0  =  2.0) 

1 

41 

-310 

■31 

-106  (min) 

-331  (min) 

Mm 

-62 

-287 

B 

-39 

-290 

B 

-4 

-274 

b)  AlC’s  computed  with  the  heuristic  and  robust  variance 
estimator. 


Table  7.3.8 

AIC's  Computed  for  the  Two-Class  MRF  Texture  Images. 
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Examples  of  Normalized  Generalized  Gaussian  Distribution 


1  I  I  I  •  *  I  •  •  I 

2  4  6  6  10  12  14  16  18  20 

a. I  Two-Component  Gaussian  Mixture 
No.  Of  points  •  500.  m.  *  (4  0.  •»  0). 

Hit  *  (9  0.  9.0).  <7*1  *  CT^j  *  (1  0.  1  0) 


1  i  I  I  I  t  ■  t  i  1 

2  4  6  8  10  12  14  16  18  20 

B.)  Three-Component  Gaussian  Mixture 
No.  of  points  *  500:  m.  *(4  0.  4  0i. 
mt  *  (9  0.  4  0).  mt  (9  0.  9  0). 

cr\  *  <7l  •  <7l  •  (1.0.  1.0). 


2  4  6  8  10  12  14  16  18  20 


C  )  Four-Component  Gaussian  Mixture 
No  of  points  •  500:  m.  •  (4.0.  4  0). 
mi  (9  0.  4  0).  (4  0.  9  0) 

m«  (9  0.  90).  <j\  *  ffi  •  <ri  ‘  ffi  -  (1  0.  10) 

Figure  7.3.3 

Examples  of  Synthetic  Gaussian  Mixture  Data. 
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Influence  Function,  INF(x,  T,  f0) 


-4.0  -2.0  0.0  2.0  4.0 


Observation,  x 

Figure  7.3.8 

Example*  of  Influence  Function*  for  Estimation  of  Intraclas*  Spread 
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a)  The  Two-Class  Region  Map 


b)  The  Two-Class  AR  Texture  Image  c)  The  Two-Class  MRF  Texture  Image 


Figure  7.3.9 

Example  of  Texture  Images 


2.0-1 

1.8 


1.64 


1.44 


1.24 


1.0- 

0.8- 

0.6- 

0.4- 


0.2- 

0.0 


0.2  0.4 


0.6 


^  i  i  i - n 

0.8  1.0  1.2  1.4  1.6 


-i - 1 

1.8  2.0 


a)  The  2nd  and  3rd  Components  of 
the  Sample  AR  Model  Parameter 
Vectors 


b)  The  Sample  MRF  Model  Parameter 
Vectors 

Figure  7.3.10 

Sample  Model  Parameter  Vectors  for  the  Texture  Images 
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c.)  3-Class  Segmentation  d.)  4-Class  Segmentation 

Suggested  by  AIC  criterion 


e.)  5-Class  Segmentation  f.)  6-Class  Segmentation 


Figure  7.3.11 

Segmented  Road  Scene  According  to  the  AIC  criterion. 
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a.)  Original  Image 


b.)  2-Class  Segmentation 


c.)  3-Class  Segmentation  d.)  4-Class  Segmentation 


e.)  5-Class  Segmentation  f.)  6-Class  Segmentation 

Suggested  by  AIC 


Figure  7.3.12 

Segmented  Oil  Tank  Scene  According  to  the  AIC  criterion. 


7.4  The  Photointerpretation  Workstation: 

7.4.1  Introduction: 


The  conceptual  approach  to  automated  photointerpretation  developed  under  this  ef¬ 
fort  has  been  described  in  some  detail  in  previous  sections.  As  mentioned  in  Section  7.1,  we 
have  implemented  this  approach  in  the  TI/Explorer.  This  implementation  makes  use  of  an 
object-oriented  based  menu  system  and  provides  a  fast,  interactive  means  of  performing 
computer  vision  research  based  on  rapid  clique  function  prototyping.  The  TI/Explorer 
implementation  has  been  called  the  Photointerpretation  Workstation.  The  remainder  of 
this  section  is  devoted  to  a  description  of  the  Photointerpretation  Workstation. 

7.4.2  Background 

The  photointerpretation  workstation  is  a  TI/Explorer  based  AI  tool  for  performing 
computer  vision  research  on  continuous-tone  images. 

Our  motivation  for  using  the  Explorer  as  a  computational  platform  is  based,  in  part, 
on  the  severe  limitation  in  the  data  structures  available  in  a  FORTRAN  programming 
environment.  We  have  found  the  traditional  data  processing  environment  to  be  too  re¬ 
strictive. 

Our  objective  is  to  move  to  an  environment  with  a  rich  set  of  data  structures.  With 
the  Explorer  as  our  target  system,  we  hope  to  reap  the  benefits  of  programming  using  an 
object-oriented  paradigm. 

7.4.3  Overview  of  PI  Workstation 

The  Photo  Interpretation  Workstation  (PIWS)  is  based  on  the  TI  Explorer.  The 
PIWS  is  an  interactive,  fast,  object-oriented,  dedicated,  hybrid  test  bed  for  photointer¬ 
pretation.  The  hybrid  nature  of  the  PIWS  comes  from  the  TMS32030  DSP  chip  which 
is  capable  of  20  MFLOP  of  throughput  on  vectorized  operations.  In  particular,  we  feel 
that  this  will  allow  for  fast  image  segmentations.  Because  of  the  proliferation  of  Explorers 
within  the  NAIC,  the  PIWS  should  be  readily  portable. 

7.4.4  User  Interface 

First,  let’s  take  a  look  at  the  programming  environment.  Using  the  flavor  system, 
we  are  able  to  take  advantage  of  the  multiple- inheritance  A.K.O.  (A  Kind  Of)  hierarchy. 
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This  enables  us  to  program  in  a  much  more  flexible  fashion.  For  example,  suppose  that  we 
wanted  to  define  a  flavor,  we  might  use  a  form  similar  to  that  which  is  presented  in  Fig. 

7.4.1. 

Here  we  can  see  that  the  user  has  defined  a  region  flavor  and  made  an  instance  of 
this  flavor,  called  rl.  This  object,  rl,  may  be  sent  a  message,  such  as  :area,  and  its 
area  information  is  then  returned.  If  we  define  a  method,  called  xompactness,  we  may 
send  the  message,  .'compactness  to  the  rl  instance,  just  as  if  it  were  an  instance  variable. 
The  message  is  treated  differently,  however,  since  every  instance  of  the  region  flavor  must 
calculate  its  own  compactness  from  given  instance  variables. 

This  is  a  very  general  approach.  Let’s  look  at  how  this  is  applied  to  the  clique 
function.  Recall  that  a  typical  clique  function  looks  something  like  that  illustrated  in  Fig. 

7.4.2.  With  corner  points  A,B,C,  and  D,  the  clique  function  flavor  might  look  like  Fig. 

7.4.3. 

Here  we  see  that  the  corner  points  become  instance  variables  of  the  clique  flavor. 
The  user  has  created  an  instance  of  the  clique  flavor  and  set  the  atom  cl  equal  to  it. 
It  is  possible  to  then  define  a  method  for  the  clique  function  which  uses  region-based 
calculations  to  design  its  own  corner  points.  In  Fig.  7.4.4,  we  see  an  example  of  the  use  of 
the  clique  flavor. 

Here  we  see  the  true  flexibility  of  the  flavor  system.  We  have  defined  a  label  flavor 
which  contains  a  clique  function  for  each  feature.  We  broadcast  to  a  list  of  regions  the 
message  feature1  and  then  calculate  the  average  of  each-feature  in  the  feature  list.  This  is 
then  returned  as  a  list  of  feature  averages  whenever  the  label  flavor  is  sent  the  :get-features 
message.  This  is  of  great  assistance  when  designing  clique  functions. 

7.4.5  SS&faa 

The  PIWS  does  not  support  image  processing  type  hardware  (yet).  There  is  no  color 
display,  no  continuous- tone  images,  and  no  image  digitization  capability.  Still  we  are  able 
to  display  dithered  images.  For  example,  the  image  in  Fig.  7.4.5  is  displayed  on  a  700  by 
700  pixel  segment  of  the  screen.  The  effective  resolution  is  about  128  by  128  and  up  to 
9  levels  of  grey  are  represented.  Using  this  technique  we  can  also  represent  a  segmented 

1Here  broadcast  is  a  function  which  sends  a  message  to  each  item  in  the  list. 
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image  as  in  Fig.  7.4.6. 

In  Fig.  7.4.7,  we  see  a  print  made  from  the  screen  of  the  PIWS. 

This  PIWS  frame  is  made  up  of  five  panes: 

1.  the  “regions”  pane, 

2.  the  “labels”  pane, 

3.  a  “typeout  window”  pane, 

4.  the  “PIWS  Command  Menu”  pane  and 

5.  a  “status  of  photo”  pane. 

In  the  “regions”  pane  we  can  see  the  name  of  the  region3  followed  by  an  assigned 
interpretation  label.  This  was  assigned  as  a  result  of  the  annealing  process.  The  first-order 
energy  level  for  the  assignment  follows  with  an  overall  energy  for  the  label  assignments  at 
the  bottom  of  the  region  pane.  The  features  for  each  area  are  also  present. 

In  the  labels  pane  we  see  the  clique  functions  with  the  weight  for  each  of  the  assign¬ 
ments.  Both  panes  are  scroll  windows  and  will  scroll  if  the  mouse  is  bumped  up  against 
the  left  hand  side  of  the  pane. 

Each  of  the  regions  in  the  region  pane  is  a  mouse  sensitive  item.  If  the  mouse  comes 
near  any  of  the  items  they  are  highlighted  (this  indicates  that  they  are  mouse  sensitive). 
If  the  mouse  is  cliqued  over  region  v2,  the  display  results  as  in  Fig.  7.4.8. 

Here  we  can  see  that  the  vegetation  label  is  highlighted.  This  indicates  the  present 
computer  interpretation.  All  of  the  items  in  this  pop-up  menu  are  mouse  sensitive  and 
may  be  changed  by  the  user.  A  similar  technique  is  applied  for  the  clique  function  pane, 
as  indicated  in  Fig.  7.4.9. 

7.4.6  Future  Work 

Currently,  we  input  a  symbolic  description  of  the  image.  No  segmentation  is  performed 
on  the  Explorer.  We  would  like  to  perform  all  the  photopreprocessing  on  the  Explorer.  This 
includes  segmentation,  histogram  equalization,  hand  segmentation,  hand  interpretation, 
and  image  display.  We  feel  that  use  of  the  TMS  32020  DSP  will  speed  up  the  segmentation 
of  the  images. 

3Note  these  names  were  assigned  to  a  training  image.  They  are  the  result  of  human 
interpretation. 
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The  Programming  Environment 

•  The  Flavor  System 

Using  the  flavor  system: 

(defflavor  region 
(area  100) 

(• 

(1) 

(setq  rl  (make-instance  'region)) 
(send  rl  :area) 

(defmethod  (region  compactness)  () 

(....)) 

*  This  provides  us  with  a  flexible,  object 
oriented  data  structure. 


Figure  7.4.1 
A  Flavor  Example 


A  Clique  Function 

f(x;  A,B,C,D) 


A  Clique  Function 


The  Clique  Flavor 


(defflavor  clique 
(a  0) 

(b  0) 

(c  0) 

(d  0) 

(• 

•) 

0) 

(setq  cl  (make-instance  ’clique)) 


Figure  7.3.3 
The  Clique  Flavor 


Using  the  Clique  flavor 


(setq  grass  (make-instance  'label)) 

/*  a  human  interpretation  */ 

(send  grass  :add-regions  rl  r2  r3...) 

(send  grass  :add-clique  cl) 

(send  grass  :design-clique-function) 

(defmethod  (label  :design-clique-function) 

(  ) 

(send  self  :get-features) 

( . )) 

(defmethod  (label  :gei-features)  () 

(loop  for  each-feature  in 
feature-list  do 

(send  each-feature  :set-average 
(average 

(broadcast  regions  feature)) 


Figure  7.4.4 
Using  the  Clique  Flavor 


Figure  7.4.5 

A  Dithered  Grey-Tone  Image 
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Figure  7.4.6 
A  Segmented  Image 
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Editing  Features  for  a  Region 
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Editing  a  Clique  Function 
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